Case Studies

We publish results for optimizations that work andoptimizations that don't. Every skill goes through the same blind evaluation. Here's what real results look like — including when we fail.

Full methodology and detailed reports: github.com/willynikes2/skill-evals

You only pay for results

If your skill doesn't improve by at least 10pp or win 60%+ of blind evaluations, you get a full refund. No questions, no friction.

Cryptographically verified reports

Every report card is SHA-256 hashed at generation time. The hash covers all judge votes, scores, and test inputs. We can't alter results after the fact — and you can prove it.

Validated

Brainstorming

A clear win. 10 out of 10.

The brainstorming skill turns ideas into structured designs before code gets written. The optimized version won every single blind evaluation — across simple features, complex systems, refactors, and greenfield projects. It's also 70% smaller.

80%
Before
96%
After
10W 0L 0T
Record
10.5KB
Original
3.1KB
Optimized

Why this optimization succeeded

  • Binary pass/fail criteria prevented reward hacking. The optimizer couldn't game vague quality scores — each criterion either passed or it didn't.
  • 3 independent blind judges all agreed on every test case. When three different models converge, the improvement is real — not an artifact of one model's preferences.
  • The optimized version is 70% smaller (10.5KB → 3.1KB). The optimizer removed redundant steps and verbose instructions, proving that concise prompts outperform bloated ones.
  • 10 diverse test cases covered the full range: simple features, complex systems, refactors, greenfield projects, API design, and security reviews. No cherry-picking.
  • This is what proper fitness function design looks like. The same pipeline that caught writing-plans gaming the judges confirmed brainstorming as a genuine improvement.

Blind Evaluation Results

Three independent AI judges evaluate both versions without knowing which is optimized. Majority rules.

Test CaseClaudeOpenAIGeminiVerdict
T01Simple FeatureOPTOPTtieOPTIMIZED
T02Complex FeatureOPTOPTtieOPTIMIZED
T03RefactorOPTOPTtieOPTIMIZED
T04Greenfield ProjectOPTOPTtieOPTIMIZED
T05API DesignOPTOPTtieOPTIMIZED
T06UI ComponentOPTOPTtieOPTIMIZED
T07Data PipelineOPTOPTtieOPTIMIZED
T08IntegrationOPTOPTtieOPTIMIZED
T09PerformanceOPTOPTtieOPTIMIZED
T10SecurityOPTOPTtieOPTIMIZED
SHA-256: a7c3f8e1d9b2...4f6a
Mar 27, 2026, 02:32 PM

Takeaways

  • 01Smaller isn't always worse — the optimized skill removed redundant steps and got better results in fewer tokens.
  • 02All three judges (Claude, OpenAI, Gemini) agreed on every test case. When judges converge, the signal is strong.
  • 03The original skill had a 10-step process. The optimized version does the same job in 7 steps.
  • 04This skill is from the Superpowers plugin by obra. We took the stock version, optimized it, and the results speak for themselves. That's the Presient model: find great skills, make them measurably better.

How report verification works

01

Hash at generation

When your report card is generated, we compute a SHA-256 hash of the complete evaluation data: every judge vote, every score, every test input and output.

02

Immutable record

The hash is embedded in your report and stored independently. If anyone changes even a single character in the evaluation data, the hash won't match.

03

You can verify

Your report includes the raw data and the hash. Run it through SHA-256 yourself. If the hashes match, the report is exactly what our pipeline produced — unaltered.

# Example: verify a report
$ sha256sum report-brainstorming-raw.json
a7c3f8e1d9b2...4f6a report-brainstorming-raw.json
# ✓ Matches the hash on your report card