The next generation of benchmarks needs to be harder, more realistic, and less gameable
【洞察】「更难、更真实、更不可刷题」——这三条标准本质上是在要求 benchmark 向「真实工作」靠拢,而非向「考试题」收敛。但这恰恰引出了一个悖论:越真实的 benchmark,越难自动化评分,越贵(METR 每题 8000 美元),越慢发布。AI 评测体系正在面临「评测速度 vs 评测质量」的根本性权衡。
The next generation of benchmarks needs to be harder, more realistic, and less gameable
【洞察】「更难、更真实、更不可刷题」——这三条标准本质上是在要求 benchmark 向「真实工作」靠拢,而非向「考试题」收敛。但这恰恰引出了一个悖论:越真实的 benchmark,越难自动化评分,越贵(METR 每题 8000 美元),越慢发布。AI 评测体系正在面临「评测速度 vs 评测质量」的根本性权衡。
those have to deliver. They they can kind of get away with not delivering uh on their promises intergenerationally
for - history - progress - secular golden age promise - can procastinate by pushing it forward to the next generation
vican barrier
for - Weismann Barrier - disputed by Darwin himself, who believed that the body sends information to the germ line to pass on to the next generation - Denis Noble
for - urban planning - tools - sustainable cities - Next Generation Cities Institute - TPF - LCE
I think that we should be putting a high priority on developing the Next Generation nuclear 01:45:54 power uh but it's uh it's uh it's going to be a a tough job and as long as the as the special 01:46:05 interests are controlling our government uh we're not going to solve it
for: climate crisis - next generation nuclear - alternative to, question - James Hansen - knowledge of deep geothermal power
climate crisis: next generation nucliear - alternative
Regev-Yochay, G., Gonen, T., Gilboa, M., Mandelboim, M., Indenbaum, V., Amit, S., Meltzer, L., Asraf, K., Cohen, C., Fluss, R., Biber, A., Nemet, I., Kliker, L., Joseph, G., Doolman, R., Mendelson, E., Freedman, L. S., Harats, D., Kreiss, Y., & Lustig, Y. (2022). 4th Dose COVID mRNA Vaccines’ Immunogenicity & Efficacy Against Omicron VOC (p. 2022.02.15.22270948). medRxiv. https://doi.org/10.1101/2022.02.15.22270948
How Europe can emerge stronger out of the coronavirus crisis. (n.d.). World Economic Forum. Retrieved 25 July 2020, from https://www.weforum.org/agenda/2020/07/resilient-european-economy/
Hiatt, J., Patwardhan, R., Turner, E. et al. Parallel, tag-directed assembly of locally derived short sequence reads. Nat Methods 7, 119–122 (2010). https://doi.org/10.1038/nmeth.1416
Darmok
Captain Dathon