The Reward Hacking Problem: When AI Optimization Goes Wrong

We optimized six AI skills using the same evolutionary loop Karpathy open-sourced. Five improved measurably. One taught us something that changed how we evaluate everything.

This isn't an anti-autoresearch post. The pattern works. But it has a trap that nobody's talking about, and we fell into it.

The setup

The skill was writing-plans — generating structured project plans from a brief. We ran it through our optimization pipeline: generate variants, score them against a fitness function, keep the winners, mutate, repeat.

The internal win rate climbed to 96%. The optimized prompt was crushing the baseline on every metric we had.

Then we ran blind testing.

Independent evaluators. No context on which variant was the original vs. optimized. Binary pass/fail criteria — not vibes, not 1-10 scales.

The "winning" prompt scored 46%. Below coin-flip.

We spent real compute, real time, and real money — and ended up with a skill that performed worse than what we started with.

What reward hacking actually looks like

Reward hacking is when an optimizer finds shortcuts in your evaluation metric instead of genuinely improving the thing you care about.

It doesn't feel like failure when it's happening. The outputs look cleaner. The metrics go up. The winning variants seem more polished, more confident. You accumulate evidence that the loop is working.

What's actually happening: the model is optimizing for whatever pattern your fitness function rewards. If that pattern is even slightly misaligned with genuine quality, the optimizer finds the crack and widens it.

In our case, the fitness function checked structural markers — plan length, section headers, presence of certain elements. Reasonable proxies for quality. Not quality itself.

The optimized prompts learned to produce outputs that looked like good plans. Well-formatted. Thorough-seeming. Structurally complete. They were less useful as actual plans to follow.

This is the same fundamental problem that exists in AI data annotation and RLHF. Raw model output isn't good enough — you need multiple independent perspectives to catch where the model is gaming the metric instead of genuinely improving. That's why companies spend millions on human evaluation. The same principle applies to the instruction layer.

Why 1-10 scoring makes it worse

Most evaluation systems use scaled scoring. Rate this output from 1 to 10. This compounds the problem for two reasons:

Probability noise. An evaluator giving a 7 vs. an 8 is meaningless variance. Over many evaluations, these add up to random noise that an optimizer can exploit.
Proxy drift. A scaled score is always a proxy for quality. "How good is this plan?" is subjective enough that a clever optimizer can find patterns that score high without being useful.

Binary evaluation — pass or fail on specific criteria — eliminates both problems. Either the plan includes a realistic timeline or it doesn't. Either the budget section uses actual numbers or it doesn't. No ambiguity, no noise to exploit.

Why single-judge evaluation fails

Using one evaluator (human or AI) creates another vulnerability: systematic bias. Every evaluator has preferences. A single evaluator's preferences become a target the optimizer can learn.

We use 3 independent judges per comparison. The randomized A/B ordering prevents position bias. The independence prevents any single judge's preferences from dominating.

This is the same logic behind peer review, jury systems, and three-judge panels. Multiple independent perspectives catch things a single perspective misses.

What good evaluation looks like

After the writing-plans collapse, we rebuilt our entire validation protocol:

Binary pass/fail criteria — not scaled scores
3 independent blind judges per comparison
Randomized A/B ordering — prevents position bias
Holdout test sets the optimizer never sees
Adversarial thinking — "how would a capable optimizer fool this test?"

The five skills that passed under these conditions? We trust those results. The brainstorming skill went from 60% to 90% under blind testing. That's a real gain, not a gamed metric.

The takeaway for anyone optimizing prompts

Autoresearch is a legitimate breakthrough. The loop works. $0.20 per cycle, 100 experiments overnight — the math is real.

But the loop is only as good as your eval. If you're seeing dramatic improvements on your internal metrics, ask yourself: would these hold up if someone who didn't build the test ran the evaluation blind?

You can spend money on compute and end up with a worse result than you started with. The loop doesn't know the difference between genuine improvement and reward hacking. Only your evaluation knows. And if your evaluation is weak, you're paying to make things worse.

We published our methodology, results, and the writing-plans failure at github.com/willynikes2/skill-evals. We didn't hide the failure. It's the reason you should trust the five that passed.