Case Studies
We publish results for optimizations that work andoptimizations that don't. Every skill goes through the same blind evaluation. Here's what real results look like — including when we fail.
Full methodology and detailed reports: github.com/willynikes2/skill-evals
You only pay for results
If your skill doesn't improve by at least 10pp or win 60%+ of blind evaluations, you get a full refund. No questions, no friction.
Cryptographically verified reports
Every report card is SHA-256 hashed at generation time. The hash covers all judge votes, scores, and test inputs. We can't alter results after the fact — and you can prove it.
Brainstorming
A clear win. 10 out of 10.
The brainstorming skill turns ideas into structured designs before code gets written. The optimized version won every single blind evaluation — across simple features, complex systems, refactors, and greenfield projects. It's also 70% smaller.
Why this optimization succeeded
- Binary pass/fail criteria prevented reward hacking. The optimizer couldn't game vague quality scores — each criterion either passed or it didn't.
- 3 independent blind judges all agreed on every test case. When three different models converge, the improvement is real — not an artifact of one model's preferences.
- The optimized version is 70% smaller (10.5KB → 3.1KB). The optimizer removed redundant steps and verbose instructions, proving that concise prompts outperform bloated ones.
- 10 diverse test cases covered the full range: simple features, complex systems, refactors, greenfield projects, API design, and security reviews. No cherry-picking.
- This is what proper fitness function design looks like. The same pipeline that caught writing-plans gaming the judges confirmed brainstorming as a genuine improvement.
Blind Evaluation Results
Three independent AI judges evaluate both versions without knowing which is optimized. Majority rules.
| Test Case | Claude | OpenAI | Gemini | Verdict |
|---|---|---|---|---|
| T01Simple Feature | OPT | OPT | tie | OPTIMIZED |
| T02Complex Feature | OPT | OPT | tie | OPTIMIZED |
| T03Refactor | OPT | OPT | tie | OPTIMIZED |
| T04Greenfield Project | OPT | OPT | tie | OPTIMIZED |
| T05API Design | OPT | OPT | tie | OPTIMIZED |
| T06UI Component | OPT | OPT | tie | OPTIMIZED |
| T07Data Pipeline | OPT | OPT | tie | OPTIMIZED |
| T08Integration | OPT | OPT | tie | OPTIMIZED |
| T09Performance | OPT | OPT | tie | OPTIMIZED |
| T10Security | OPT | OPT | tie | OPTIMIZED |
Takeaways
- 01Smaller isn't always worse — the optimized skill removed redundant steps and got better results in fewer tokens.
- 02All three judges (Claude, OpenAI, Gemini) agreed on every test case. When judges converge, the signal is strong.
- 03The original skill had a 10-step process. The optimized version does the same job in 7 steps.
- 04This skill is from the Superpowers plugin by obra. We took the stock version, optimized it, and the results speak for themselves. That's the Presient model: find great skills, make them measurably better.
How report verification works
Hash at generation
When your report card is generated, we compute a SHA-256 hash of the complete evaluation data: every judge vote, every score, every test input and output.
Immutable record
The hash is embedded in your report and stored independently. If anyone changes even a single character in the evaluation data, the hash won't match.
You can verify
Your report includes the raw data and the hash. Run it through SHA-256 yourself. If the hashes match, the report is exactly what our pipeline produced — unaltered.