Hypothesis

3 Matching Annotations

Apr 2026
www.microsoft.com www.microsoft.com

https://www.microsoft.com/en-us/research/blog/adele-predicting-and-explaining-ai-performance-across-tasks/

2
1. fxp007 17 Apr 2026
  
  in Public
  
  The same model can score above 90% on lower-demand tests and below 15% on more demanding ones, reflecting differences in task requirements rather than a change in capability.
  
  这一发现揭示了AI评估中的一个令人惊讶的现象：模型性能的巨大波动可能主要源于任务难度差异，而非模型本身能力的变化。这挑战了我们对AI'能力'的简单理解，表明AI系统可能在特定能力上存在明显的'阈值效应'，在达到某个难度水平后性能急剧下降。
  
  threshold-effect task-difficulty
2. fxp007 16 Apr 2026
  
  in Public
  
  The same model can score above 90% on lower-demand tests and below 15% on more demanding ones, reflecting differences in task requirements rather than a change in capability.
  
  令人惊讶的是：同一个AI模型在低需求测试中可能获得90%以上的分数，而在高需求测试中却可能低于15%，这反映了任务需求的不同而非模型能力的改变。这一发现挑战了人们对AI能力稳定性的普遍认知，揭示了任务难度对AI表现的巨大影响。
  
  surprising ai-capabilities task-difficulty
Visit annotations in context

Tags

ai-capabilities

threshold-effect

task-difficulty

surprising

Annotators

fxp007

URL

microsoft.com/en-us/research/blog/adele-predicting-and-explaining-ai-performance-across-tasks/
Aug 2020
arxiv.org arxiv.org

The Wisdom and Persuadability of Threads

1
1. marta_radosevic 14 Aug 2020
  
  in BehSci
  
  Engelhardt, R., Hendricks, V. F., & Stærk-Østergaard, J. (2020). The Wisdom and Persuadability of Threads. ArXiv:2008.05203 [Physics]. http://arxiv.org/abs/2008.05203
  
  is:preprint lang:en decision-making judgement thread persuadability social information crowd accuracy model task difficulty communication
Visit annotations in context

Tags

social information

persuadability

lang:en

judgement

task difficulty

model

crowd accuracy

is:preprint

communication

decision-making

thread

Annotators

marta_radosevic

URL

arxiv.org/abs/2008.05203

Tags

Annotators

URL

Tags

Annotators

URL