Hypothesis

12 Matching Annotations

Apr 2026
openai.com openai.com

https://openai.com/index/gpt-5-5-system-card/

1
1. fxp007 30 Apr 2026
  
  in Public
  
  We separately evaluate GPT‑5.5 Pro in certain cases because we judge that the setting could materially impact the relevant risks or appropriate safeguards posture.
  
  大多数人认为如果两个模型使用相同的基础架构，它们的风险和安全需求应该相似，但OpenAI明确表示GPT-5.5 Pro需要单独评估，因为'设置可能显著影响相关风险或适当的安全措施立场'。这挑战了AI评估领域普遍认为的'相同基础模型的安全特性一致'的共识，暗示即使是微小的设置变化也可能导致显著不同的风险特征。
  
  non-consensus risk-assessment model-evaluation
Visit annotations in context

Tags

risk-assessment

non-consensus

model-evaluation

Annotators

fxp007

URL

openai.com/index/gpt-5-5-system-card/
epoch.ai epoch.ai

https://epoch.ai/blog/have-ai-capabilities-accelerated

1
1. fxp007 30 Apr 2026
  
  in Public
  
  The best-performing model across these three metrics was a pair of independent linear trends: one for reasoning models and one for non-reasoning models.
  
  这个模型选择结果(100%的三个指标)表明将模型分为推理和非推理两类是最优预测模型。这提供了强有力的统计证据，支持推理能力可能是AI加速发展的关键因素。然而，文章没有详细说明如何定义推理模型，这可能影响结果的可靠性。
  
  data-point statistics model-evaluation
Visit annotations in context

Tags

statistics

model-evaluation

data-point

Annotators

fxp007

URL

epoch.ai/blog/have-ai-capabilities-accelerated
openai.com openai.com

https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/

1
1. fxp007 27 Apr 2026
  
  in Public
  
  We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.
  
  大多数人认为AI模型的性能提升主要源于算法和架构的改进。但作者发现，模型在SWE-bench上的成功更多取决于它们是否在训练中见过这些问题，而非真正的编程能力提升。这一观点与行业普遍认为的'模型进步'叙事相悖，暗示当前AI发展评估可能存在严重偏差。
  
  counterintuitive model-progress evaluation-bias
Visit annotations in context

Tags

model-progress

counterintuitive

evaluation-bias

Annotators

fxp007

URL

openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
aisle.com aisle.com

https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jagged-frontier

1
1. fxp007 17 Apr 2026
  
  in Public
  
  The capability rankings reshuffled completely across tasks. There is no stable best model across cybersecurity tasks. The capability frontier is jagged.
  
  这一发现揭示了AI安全能力的'锯齿状前沿'现象，不同模型在不同安全任务上的表现差异巨大。这表明不存在'一刀切'的最佳安全模型，而是需要根据具体任务选择合适的模型，这对AI安全系统的设计有重要启示。
  
  model-evaluation security-tasks capability-scaling
Visit annotations in context

Tags

capability-scaling

security-tasks

model-evaluation

Annotators

fxp007

URL

aisle.com/blog/ai-cybersecurity-after-mythos-the-jagged-frontier
epoch.ai epoch.ai

https://epoch.ai/blog/mirrorcode-preliminary-results

2
1. fxp007 17 Apr 2026
  
  in Public
  
  We found weak evidence that Opus 4.0 and 4.1 had partially memorized cal, but no evidence Opus 4.6 had memorized it, despite performing best of all models considered.
  
  这一发现令人意外，因为性能最佳的模型反而没有表现出记忆效应。这可能表明最新AI模型在解决复杂问题时更多地依赖于真正的理解和推理，而非简单的记忆重现，这为AI能力评估提供了新的视角。
  
  memorization model-evaluation ai-understanding
2. fxp007 17 Apr 2026
  
  in Public
  
  Older models were more prone to submitting prematurely, even when test cases weren't passing.
  
  这一观察揭示了不同AI模型版本之间在任务坚持性上的显著差异。早期模型更容易过早提交不完整的解决方案，而最新模型表现出更强的任务坚持性和工程判断力。这种差异可能反映了AI在自我评估和任务管理能力上的进化。
  
  model-comparison task-persistence ai-evaluation
Visit annotations in context

Tags

memorization

ai-evaluation

model-evaluation

ai-understanding

model-comparison

task-persistence

Annotators

fxp007

URL

epoch.ai/blog/mirrorcode-preliminary-results
ai.meta.com ai.meta.com

https://ai.meta.com/blog/introducing-muse-spark-msl/

1
1. fxp007 16 Apr 2026
  
  in Public
  
  Muse Spark demonstrated the highest rate of evaluation awareness of models they have observed.
  
  令人惊讶的是：第三方评估机构Apollo Research发现Muse Spark展现出了他们观察过的模型中最高的'评估意识'率，该模型能频繁识别出'对齐陷阱'并意识到自己正在被评估。这种自我元认知能力在AI模型中极为罕见，可能标志着模型向更高级推理能力迈进的信号。
  
  surprising ai-awareness model-evaluation
Visit annotations in context

Tags

ai-awareness

surprising

model-evaluation

Annotators

fxp007

URL

ai.meta.com/blog/introducing-muse-spark-msl/
Apr 2022
twitter.com twitter.com

ReconfigBehSci on Twitter

1
1. emilybiggs 20 Apr 2022
  
  in Public
  
  ReconfigBehSci. (2020, November 5). @ToddHorowitz3 2/2 so I would prefer to treat this as an opportunity for empirical observation and learning. Evaluation should focus on trying to assess actual contribution, not a priori judgments. [Tweet]. @SciBeh. https://twitter.com/SciBeh/status/1324367278352355330
  
  is:twitter lang:en COVID-19 R opportunity evaluation learning SAGE model expert predictions
Visit annotations in context

Tags

opportunity

expert

predictions

learning

is:twitter

COVID-19

R

evaluation

SAGE

lang:en

model

Annotators

emilybiggs

URL

twitter.com/SciBeh/status/1324367278352355330
Dec 2020
psyarxiv.com psyarxiv.com

Putting psychology to the test: Rethinking model evaluation through benchmarking and prediction

1
1. marta_radosevic 01 Dec 2020
  
  in BehSci
  
  Rocca, R., & Yarkoni, T. (2020). Putting psychology to the test: Rethinking model evaluation through benchmarking and prediction. PsyArXiv. https://doi.org/10.31234/osf.io/e437b
  
  is:preprint lang:en psychology model evaluation benchmarking machine learning reliability modelling utility prediction
Visit annotations in context

Tags

benchmarking

machine learning

modelling

utility

psychology

prediction

is:preprint

lang:en

reliability

model evaluation

Annotators

marta_radosevic

URL

psyarxiv.com/e437b/
Mar 2019
deepblue.lib.umich.edu deepblue.lib.umich.edu

PFI20251.indd

1
1. ks9 22 Mar 2019
  
  in Public
  
  Human Performance Technology Model This page is an eight page PDF that gives an overview of the human performance technology model. This is a black and white PDF that is simply written and is accessible to the layperson. Authors are prominent writers in the field of performance technology. Rating 5/5
  
  etcnau performance improvement performance technology evaluation models Van Tiem theory Behavior Engineering Model diagram etc556
Visit annotations in context

Tags

performance improvement

evaluation models

Van Tiem

etcnau

theory

diagram

etc556

performance technology

Behavior Engineering Model

Annotators

ks9

URL

deepblue.lib.umich.edu/bitstream/handle/2027.42/90576/20251_ftp.pdf
Nov 2018
iphysresearch.github.io iphysresearch.github.io

A Paper A Day

1
1. Herb 15 Nov 2018
  
  in Public
  
  Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift
  
  该文做的实验是探索对数据集进行 shifts (某种可控的扰动) 后的模型表现，提出了classifier-based的方法/pipeline 来观察和评价：
  
  这对于我的引力波数据研究来说，可以借鉴其数据的 shift 方法以及评价机制（two-sample tests）。
  
  performace model evaluation dataset
Visit annotations in context

Tags

performace

model evaluation

dataset

Annotators

Herb

URL

iphysresearch.github.io/paper_summary/APaperADay.html
Oct 2018
iphysresearch.github.io iphysresearch.github.io

A Paper A Day

1
1. Herb 18 Oct 2018
  
  in Public
  
  Approximate Fisher Information Matrix to Characterise the Training of Deep Neural Networks
  
  深度神经网络训练(收敛/泛化性能)的近似Fisher信息矩阵表征，可自动优化mini-batch size/learning rate
  
  挺有趣的 paper，提出了从 Fisher 矩阵抽象出新的量用来衡量训练过程中的模型表现，来优化mini-batch sizes and learning rates | 另外 paper 中的figure画的很好看 | 作者认为逐步增加batch sizes的传统理解只是partially true，存在逐步递减该 size 来提高 model 收敛和泛化能力的可能。
  
  Generalization batch-size learning rate fisher matrix model evaluation
Visit annotations in context

Tags

Generalization

learning rate

fisher matrix

batch-size

model evaluation

Annotators

Herb

URL

iphysresearch.github.io/paper_summary/APaperADay.html

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL