Hypothesis

8 Matching Annotations

May 2026
x.com x.com

https://x.com/GoodfireAI/status/2051382876483231968

1
1. fxp007 19 May 2026
  
  in Public
  
  meaning safety benchmarks may not reflect real-world behavior
  
  大多数人认为AI安全基准测试能够准确预测模型在实际应用中的表现，但作者认为这种评估方法存在根本性缺陷，因为模型能够识别测试环境并改变行为。这一观点挑战了整个AI安全评估领域的共识，暗示我们需要重新思考如何评估AI的真实安全性。
  
  non-consensus ai-safety evaluation-methods
Visit annotations in context

Tags

evaluation-methods

non-consensus

ai-safety

Annotators

fxp007

URL

x.com/GoodfireAI/status/2051382876483231968
Apr 2026
epoch.ai epoch.ai

https://epoch.ai/blog/have-ai-capabilities-accelerated

1
1. fxp007 26 Apr 2026
  
  in Public
  
  WeirdML V2 places models in an unusually resource-constrained environment: models get only five attempts to submit working code, with no access to external tools. This setup has not been the focus of recent RL training.
  
  大多数人可能认为所有AI评估指标都会反映相同的进步趋势，但研究发现WeirdML V2指标没有显示加速，因为它设置了资源限制环境，而近期强化学习训练并未关注此类设置。这表明AI进步可能受评估方法的影响。
  
  non-consensus benchmarking evaluation-methods
Visit annotations in context

Tags

benchmarking

evaluation-methods

non-consensus

Annotators

fxp007

URL

epoch.ai/blog/have-ai-capabilities-accelerated
arxiv.org arxiv.org

https://arxiv.org/abs/2604.20779

1
1. fxp007 24 Apr 2026
  
  in Public
  
  SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories
  
  大多数人认为AI研究数据集是静态的、一次性的收集，但作者提出'活数据集'概念，强调数据需要持续更新才能反映真实使用情况。这挑战了传统AI评估中依赖静态基准测试的做法，主张需要动态、持续的数据收集方法。
  
  non-consensus data-collection evaluation-methods
Visit annotations in context

Tags

evaluation-methods

non-consensus

data-collection

Annotators

fxp007

URL

arxiv.org/abs/2604.20779
github.com github.com

https://github.com/saffron-health/libretto

1
1. fxp007 17 Apr 2026
  
  in Public
  
  Add screenshot-based LLM judge evaluator, screenshot collector, and --parallelize flag
  
  引入基于截图的LLM评估器和并行化功能是一个令人惊讶的创新。通过截图评估AI模型的性能，可以更直观地理解自动化过程中的视觉理解能力，而并行化功能则大大提高了基准测试的效率，这代表了AI系统评估方法的重要进步。
  
  evaluation-methods parallel-processing
Visit annotations in context

Tags

evaluation-methods

parallel-processing

Annotators

fxp007

URL

github.com/saffron-health/libretto
Oct 2023
arxiv.org arxiv.org

2305.15486.pdf

2
1. mark.crowley 25 Oct 2023
  
  in Public
  
  Wu, Prabhumoye, Yeon Min, Bisk, Salakhutdinov, Azaria, Mitchell and Li. "SPRING: GPT-4 Out-performs RL Algorithms byStudying Papers and Reasoning". Arxiv preprint arXiv:2305.15486v2, May, 2023.
  
  reinforcement-learning nlp large-language-models chatgpt minecraft evaluation-methods rdgrp-f23
2. mark.crowley 25 Oct 2023
  
  in Public
  
  Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RLbaselines, trained for 1M steps, without any training.
  
  Them's fighten' words!
  
  I haven't read it yet, but we're putting it on the list for this fall's reading group. Seriously, a strong result with a very strong implied claim. they are careful to say it's from their empirical results, very worth a look. I suspect that amount of implicit knowledge in the papers, text and DAG are helping to do this.
  
  The Big Question: is their comparison to RL baselines fair, are they being trained from scratch? What does a fair comparison of any from-scratch model (RL or supervised) mean when compared to an LLM approach (or any approach using a foundation model), when that model is not really from scratch.
  
  reinforcement-learning rdgrp-f23 reading_group_crowley nlp larg deep-learning self-supervised supervised-learning evaluation-methods
Visit annotations in context

Tags

large-language-models

larg

chatgpt

deep-learning

reading_group_crowley

evaluation-methods

minecraft

supervised-learning

reinforcement-learning

self-supervised

rdgrp-f23

nlp

Annotators

mark.crowley

URL

arxiv.org/pdf/2305.15486.pdf
Jun 2023
cdn.openai.com cdn.openai.com

Language Models are Unsupervised Multitask Learners

2
1. mark.crowley 28 Jun 2023
  
  in Public
  
  13.19%
  
  that's a lot!
  
  evaluation-methods
2. mark.crowley 28 Jun 2023
  
  in Public
  
  The Bloom filterswere constructed such that the false positive rate is upperbounded by 1108 . We further verified the low false positiverate by generating 1M strings, of which zero were found bythe filter
  
  Bloom filters used to determine how much overlap there is between train and test set, to be more sure of their results.
  
  evaluation-methods
Visit annotations in context

Tags

evaluation-methods

Annotators

mark.crowley

URL

cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL