Hypothesis

it's impossible to get a score much higher than 93% without cheating because around 6.5% of MMLU questions contain errors.

MMLU 有 6.5% 的题目本身就是错的——这意味着任何模型的「真实上限」是 93.5%，而不是 100%。更令人惊讶的是：这个广泛使用了数年的权威 benchmark，其误差率直到最近才被系统研究和量化。这揭示了整个 AI 评测生态的一个深层问题：benchmark 的质量本身也需要 benchmark，而这一层元评估几乎从未被认真对待。

MMLU-errors 6.5-percent benchmark-quality meta-evaluation

Tags

Annotators

URL