Hypothesis

3 Matching Annotations

May 2026
arxiv.org arxiv.org

Untitled document

1
1. fxp007 19 May 2026
  
  in Public
  
  pluralism is most decisively made or unmade at the deployment-governance layer: interfaces, preference-data pipelines, and audit infrastructure.
  
  This argument shifts the locus of the problem from the model's architecture to the socio-technical systems that surround it. It's a provocative claim that the core issue isn't 'how to build a better model' but 'how to build a better system for deploying and governing models,' placing the onus on developers and regulators, not just AI researchers.
  
  Deployment Governance Systemic Failure
Visit annotations in context

Tags

Deployment Governance

Systemic Failure

Annotators

fxp007

URL

arxiv.org/abs/2605.14912
Apr 2026
rdi.berkeley.edu rdi.berkeley.edu

https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

1
1. fxp007 16 Apr 2026
  
  in Public
  
  We built an automated scanning agent that systematically audited eight among the most prominent AI agent benchmarks — SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, and CAR-bench — and discovered that every single one can be exploited to achieve near-perfect scores without solving a single task.
  
  令人惊讶的是：研究人员构建的自动化扫描工具发现，所有八个主流AI代理基准测试都存在漏洞，无需解决任何任务就能获得接近完美的分数。这表明整个AI评估领域存在系统性问题，几乎所有当前使用的基准测试都不可靠。
  
  surprising ai-evaluation systemic-failure
Visit annotations in context

Tags

ai-evaluation

surprising

systemic-failure

Annotators

fxp007

URL

rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/
Oct 2020
leiss.ca leiss.ca

Complexity cloaks Catastrophe

1
1. TylerRick 14 Oct 2020
  
  in Public
  
  complexity in financial innovations is itself an important risk factor for systemic failure in the financial sector
  
  risk factor systemic failure systemic problem
Visit annotations in context

Tags

systemic failure

risk factor

systemic problem

Annotators

TylerRick

URL

leiss.ca/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL