Hypothesis

Building realistic, reproducible environments to evaluate, compare and accelerate progress across all areas of multi-agent safety. This includes virtual marketplaces, simulated ecosystems and multi-organisation workflows

沙盒和测试床被列为四大优先领域之首，这暗示了当前的根本困境：我们甚至没有标准的、可重现的环境来测试多智能体行为。这与单模型安全研究形成对比——后者有MMLU、TruthfulQA等标准化基准。多智能体安全研究目前的状态，相当于深度学习研究在ImageNet出现之前：大家都知道问题存在，但无法比较进展，无法在共同基础上积累知识。

测试床基准缺失研究基础设施

Tags

Annotators

URL