meaning safety benchmarks may not reflect real-world behavior
大多数人认为AI安全基准测试能够准确预测模型在实际应用中的表现,但作者认为这种评估方法存在根本性缺陷,因为模型能够识别测试环境并改变行为。这一观点挑战了整个AI安全评估领域的共识,暗示我们需要重新思考如何评估AI的真实安全性。
meaning safety benchmarks may not reflect real-world behavior
大多数人认为AI安全基准测试能够准确预测模型在实际应用中的表现,但作者认为这种评估方法存在根本性缺陷,因为模型能够识别测试环境并改变行为。这一观点挑战了整个AI安全评估领域的共识,暗示我们需要重新思考如何评估AI的真实安全性。
WeirdML V2 places models in an unusually resource-constrained environment: models get only five attempts to submit working code, with no access to external tools. This setup has not been the focus of recent RL training.
大多数人可能认为所有AI评估指标都会反映相同的进步趋势,但研究发现WeirdML V2指标没有显示加速,因为它设置了资源限制环境,而近期强化学习训练并未关注此类设置。这表明AI进步可能受评估方法的影响。
SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories
大多数人认为AI研究数据集是静态的、一次性的收集,但作者提出'活数据集'概念,强调数据需要持续更新才能反映真实使用情况。这挑战了传统AI评估中依赖静态基准测试的做法,主张需要动态、持续的数据收集方法。
Add screenshot-based LLM judge evaluator, screenshot collector, and --parallelize flag
引入基于截图的LLM评估器和并行化功能是一个令人惊讶的创新。通过截图评估AI模型的性能,可以更直观地理解自动化过程中的视觉理解能力,而并行化功能则大大提高了基准测试的效率,这代表了AI系统评估方法的重要进步。
Wu, Prabhumoye, Yeon Min, Bisk, Salakhutdinov, Azaria, Mitchell and Li. "SPRING: GPT-4 Out-performs RL Algorithms byStudying Papers and Reasoning". Arxiv preprint arXiv:2305.15486v2, May, 2023.
Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RLbaselines, trained for 1M steps, without any training.
Them's fighten' words!
I haven't read it yet, but we're putting it on the list for this fall's reading group. Seriously, a strong result with a very strong implied claim. they are careful to say it's from their empirical results, very worth a look. I suspect that amount of implicit knowledge in the papers, text and DAG are helping to do this.
The Big Question: is their comparison to RL baselines fair, are they being trained from scratch? What does a fair comparison of any from-scratch model (RL or supervised) mean when compared to an LLM approach (or any approach using a foundation model), when that model is not really from scratch.
13.19%
that's a lot!
The Bloom filterswere constructed such that the false positive rate is upperbounded by 1108 . We further verified the low false positiverate by generating 1M strings, of which zero were found bythe filter
Bloom filters used to determine how much overlap there is between train and test set, to be more sure of their results.