10 Matching Annotations
  1. Last 7 days
    1. Opus 4.7 introduces a new `xhigh` ('extra high') effort level between `high` and `max`, giving users finer control over the tradeoff between reasoning and latency on hard problems.

      引入'xhigh'努力等级显示了AI模型在推理深度与响应速度之间提供更精细控制的能力,这反映了用户对AI性能调优需求的增长,也表明AI系统正变得更加可定制和专业化。

    1. I-DLM-8B is the first DLM to match the quality of its same-scale AR counterpart, outperforming LLaDA-2.1-mini (16B) by +26 on AIME-24 and +15 on LiveCodeBench-v6 with half the parameters

      令人惊讶的是:I-DLM-8B模型仅用80亿参数就超过了160亿参数的LLaDA-2.1-mini模型,在AIME-24和LiveCodeBench-v6测试中分别高出26和15分。这表明扩散模型首次达到了与自回归模型相当的质量水平,同时参数减半,打破了人们对扩散模型质量不如自回归模型的普遍认知。

  2. Apr 2026
    1. Cost (USD) to run the evaluation: GPT-5.4 (xhigh): $1,110, Claude Opus 4.6 (max): $1,055

      运行一次 452 个任务的评测,GPT-5.4 花费 1110 美元,Claude Opus 4.6 花费 1055 美元——每个任务平均约 2.3 美元。而 Gemini 3 Flash 只需要 596 美元,实现了 27.7% 的成绩(vs 顶级模型的 33.3%)。这个性价比数据对 AI 选型决策极为关键:如果业务场景可以接受 27% 而非 33% 的成功率,Gemini 3 Flash 能节省近一半成本。在金融服务的大规模部署中,这个差异将被放大数千倍。

    1. our approach improves Qwen3.5-4B from 63.8 percent to 66.7 percent (+2.9pp) and Qwen3-30B-A3B from 58.0 percent to 69.5 percent (+11.5pp)

      大多数人认为在复杂的多轮任务中,只有大型语言模型才能通过强化学习取得显著进步,但作者展示了即使是较小的4B模型也能通过他们的方法获得实质性提升,而30B模型的提升更是惊人地达到了11.5个百分点,挑战了'规模越大越好'的普遍认知。

    2. the trained 4B model exceeding GPT-4.1 (49.4 percent) and GPT-4o (42.8 percent) despite being 50 times smaller

      大多数人认为AI模型的大小与性能直接正相关,更大的模型必然表现更好。但作者展示了一个仅40亿参数的模型通过强化学习训练后,性能超越了比它大50倍的GPT-4.1和GPT-4o,挑战了当前AI领域'参数规模决定一切'的主流观点。

  3. Nov 2022
    1. The most intriguing result in the present study is thepositive effect of white noise on performance for theADHD children. This noise effect was present in boththe non-medicated and medicated children. Thissupports the MBA (Moderate Brain Arousal) model(Sikstro ̈m & So ̈derlund, 2007), suggesting that theendogenous (neural) noise level in children withADHD is sub-optimal. MBA accounts for the noise-enhancing phenomenon by stochastic resonance(SR). The model suggests that noise in the environ-ment introduces internal noise into the neural sys-tem through the perceptual system. Of particularimportance, the MBA model suggests that the peakof the SR curve depends on the dopamine level, sothat participants with low dopamine levels (ADHD)require more noise for optimal cognitive performancecompared to controls.

      Author's self-described "most intriguing result"

  4. Apr 2022
    1. (7) ReconfigBehSci on Twitter: “@ToddHorowitz3 probably- and I think there are many interesting questions around why he is there and whether he should be there. But to answer those properly, looking at the performance of the model seems important and interesting to me- that is all I am saying” / Twitter. (n.d.). Retrieved March 6, 2021, from https://twitter.com/SciBeh/status/1324389147050569734

  5. Mar 2019