Hypothesis

2 Matching Annotations

Apr 2026
openai.com openai.com

https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/

1
1. fxp007 27 Apr 2026
  
  in Public
  
  We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.
  
  大多数人认为AI模型的性能提升主要源于算法和架构的改进。但作者发现，模型在SWE-bench上的成功更多取决于它们是否在训练中见过这些问题，而非真正的编程能力提升。这一观点与行业普遍认为的'模型进步'叙事相悖，暗示当前AI发展评估可能存在严重偏差。
  
  counterintuitive model-progress evaluation-bias
Visit annotations in context

Tags

evaluation-bias

model-progress

counterintuitive

Annotators

fxp007

URL

openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
a16z.com a16z.com

Where Enterprises are Actually Adopting AI - a16z

1
1. fxp007 16 Apr 2026
  
  in Public
  
  The most notable finding here is that the model capabilities are improving _fast._ There are several domains that have shown dramatic improvements in the last 4 months — with accounting and auditing showing nearly a 20 percent jump on GDPval and even domains like police / detective work showing a nearly 30 percent improvement.
  
  令人惊讶的是：AI模型能力在过去4个月内取得了惊人的进步，会计和审计领域在GDPval基准测试中提升了近20%，而警察/侦探工作领域甚至提升了近30%。这种快速进步的速度远超人们的预期，预示着AI将在更多领域实现突破性应用。
  
  surprising ai-progress model-improvement fun-fact
Visit annotations in context

Tags

ai-progress

surprising

fun-fact

model-improvement

Annotators

fxp007

URL

a16z.com/where-enterprises-are-actually-adopting-ai/

Tags

Annotators

URL

Tags

Annotators

URL