Case Studies

We publish results for optimizations that work andoptimizations that don't. Every skill goes through the same blind evaluation. Here's what real results look like — including when we fail.

Full methodology and detailed reports: github.com/willynikes2/skill-evals

You only pay for results

If your skill doesn't improve by at least 10pp or win 60%+ of blind evaluations, you get a full refund. No questions, no friction.

Cryptographically verified reports

Every report card is SHA-256 hashed at generation time. The hash covers all judge votes, scores, and test inputs. We can't alter results after the fact — and you can prove it.

Validated

Brainstorming

A clear win. 10 out of 10.

The brainstorming skill turns ideas into structured designs before code gets written. The optimized version won every single blind evaluation — across simple features, complex systems, refactors, and greenfield projects. It's also 70% smaller.

80%

Before

96%

After

10W 0L 0T

Record

10.5KB

Original

3.1KB

Optimized

Why this optimization succeeded

Binary pass/fail criteria prevented reward hacking. The optimizer couldn't game vague quality scores — each criterion either passed or it didn't.
3 independent blind judges all agreed on every test case. When three different models converge, the improvement is real — not an artifact of one model's preferences.
The optimized version is 70% smaller (10.5KB → 3.1KB). The optimizer removed redundant steps and verbose instructions, proving that concise prompts outperform bloated ones.
10 diverse test cases covered the full range: simple features, complex systems, refactors, greenfield projects, API design, and security reviews. No cherry-picking.
This is what proper fitness function design looks like. The same pipeline that caught writing-plans gaming the judges confirmed brainstorming as a genuine improvement.

Blind Evaluation Results

Three independent AI judges evaluate both versions without knowing which is optimized. Majority rules.

Test Case	Claude	OpenAI	Gemini	Verdict
T01Simple Feature	OPT	OPT	tie	OPTIMIZED
T02Complex Feature	OPT	OPT	tie	OPTIMIZED
T03Refactor	OPT	OPT	tie	OPTIMIZED
T04Greenfield Project	OPT	OPT	tie	OPTIMIZED
T05API Design	OPT	OPT	tie	OPTIMIZED
T06UI Component	OPT	OPT	tie	OPTIMIZED
T07Data Pipeline	OPT	OPT	tie	OPTIMIZED
T08Integration	OPT	OPT	tie	OPTIMIZED
T09Performance	OPT	OPT	tie	OPTIMIZED
T10Security	OPT	OPT	tie	OPTIMIZED

SHA-256: a7c3f8e1d9b2...4f6a

Mar 27, 2026, 02:32 PM

Takeaways

01Smaller isn't always worse — the optimized skill removed redundant steps and got better results in fewer tokens.
02All three judges (Claude, OpenAI, Gemini) agreed on every test case. When judges converge, the signal is strong.
03The original skill had a 10-step process. The optimized version does the same job in 7 steps.
04This skill is from the Superpowers plugin by obra. We took the stock version, optimized it, and the results speak for themselves. That's the Presient model: find great skills, make them measurably better.

Download this skill for free

How report verification works

Hash at generation

When your report card is generated, we compute a SHA-256 hash of the complete evaluation data: every judge vote, every score, every test input and output.

Immutable record

The hash is embedded in your report and stored independently. If anyone changes even a single character in the evaluation data, the hash won't match.

You can verify

Your report includes the raw data and the hash. Run it through SHA-256 yourself. If the hashes match, the report is exactly what our pipeline produced — unaltered.

# Example: verify a report

$ sha256sum report-brainstorming-raw.json

a7c3f8e1d9b2...4f6a report-brainstorming-raw.json

# ✓ Matches the hash on your report card