We have too many benchmarks. Most of them measure things that are not what you care about. This article proposes a framework for choosing evaluations that actually correlate with the downstream task — and being honest about the dozen popular ones that do not.