We have too many benchmarks. Most of them measure things that are not what you care about. This article proposes a framework for choosing evaluations that actually correlate with the downstream task — and being honest about the dozen popular ones that do not.
← Back to archive
The four evals that matter (and the dozen that don't)
We have too many benchmarks and too few signals. A framework for choosing evaluations that correlate with the thing you actually care about.
Cite as: Chen, Liam. "The four evals that matter (and the dozen that don't)." mlsystems.dev, Jul 22, 2026.
Discussion