10 Matching Annotations
  1. Last 7 days
    1. The next generation of benchmarks needs to be harder, more realistic, and less gameable

      【洞察】「更难、更真实、更不可刷题」——这三条标准本质上是在要求 benchmark 向「真实工作」靠拢,而非向「考试题」收敛。但这恰恰引出了一个悖论:越真实的 benchmark,越难自动化评分,越贵(METR 每题 8000 美元),越慢发布。AI 评测体系正在面临「评测速度 vs 评测质量」的根本性权衡。

  2. Dec 2025
  3. Nov 2024
  4. Feb 2024
  5. Dec 2023
    1. I think that we should be putting a high priority on developing the Next Generation nuclear 01:45:54 power uh but it's uh it's uh it's going to be a a tough job and as long as the as the special 01:46:05 interests are controlling our government uh we're not going to solve it
  6. Aug 2022
  7. Jul 2020
  8. Jun 2020
  9. Mar 2020