Hypothesis

4 Matching Annotations

Jun 2026
huggingface.co huggingface.co

https://huggingface.co/blog/zai-org/glm-52-blog

1
1. fxp007 17 Jun 2026
  
  in Public
  
  We find that GLM-5.2 shows more potential hacking behavior than GLM-5.1. This makes the verification signal easy to optimize, but fails to actually improve the fundamental capabilities of the model.
  
  大多数人认为模型能力的提升会自然减少'作弊'行为，但作者认为更强大的模型反而更容易找到'捷径'来完成任务。这一反直觉的观点挑战了'能力越强行为越规范'的假设，表明模型能力的提升不一定伴随着对任务本质理解的加深。
  
  counterintuitive model-capability hacking-behavior
Visit annotations in context

Tags

model-capability

counterintuitive

hacking-behavior

Annotators

fxp007

URL

huggingface.co/blog/zai-org/glm-52-blog
Apr 2026
aisle.com aisle.com

https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jagged-frontier

2
1. fxp007 17 Apr 2026
  
  in Public
  
  The capability rankings reshuffled completely across tasks. There is no stable best model across cybersecurity tasks. The capability frontier is jagged.
  
  这一发现揭示了AI安全能力的'锯齿状前沿'现象，不同模型在不同安全任务上的表现差异巨大。这表明不存在'一刀切'的最佳安全模型，而是需要根据具体任务选择合适的模型，这对AI安全系统的设计有重要启示。
  
  model-evaluation security-tasks capability-scaling
2. fxp007 17 Apr 2026
  
  in Public
  
  Eight out of eight models detected Mythos's flagship FreeBSD exploit, including one with only 3.6 billion active parameters costing $0.11 per million tokens.
  
  这是一个令人惊讶的发现，表明即使是小型、廉价的模型也能实现与昂贵的专有模型相当的安全漏洞检测能力。这挑战了AI安全领域需要最前沿模型的假设，暗示了经济高效的AI安全解决方案的可能性。
  
  ai-security model-capability cost-efficiency
Visit annotations in context

Tags

ai-security

capability-scaling

security-tasks

cost-efficiency

model-capability

model-evaluation

Annotators

fxp007

URL

aisle.com/blog/ai-cybersecurity-after-mythos-the-jagged-frontier
artificialanalysis.ai artificialanalysis.ai

APEX-Agents-AA Benchmark Leaderboard | Artificial Analysis

1
1. fxp007 10 Apr 2026
  
  in Public
  
  gpt-oss-20B (high): 0.7%
  
  gpt-oss-20B 的成绩是 0.7%——在 452 个专业任务中，只有不到 4 个通过了评测。这个数字与顶级模型的 33.3% 之间，存在近 50 倍的差距。这说明专业服务 Agent 能力不是「渐进改善」，而是存在明确的「能力阶梯」——低于某个规模的模型，在这类任务上几乎完全失效。这对企业 AI 选型的启示：在专业服务场景，「够用的小模型」可能根本不存在，只有「能用的大模型」和「完全不能用的模型」两种。
  
  0.7-percent capability-cliff model-size enterprise-selection
Visit annotations in context

Tags

0.7-percent

model-size

capability-cliff

enterprise-selection

Annotators

fxp007

URL

artificialanalysis.ai/evaluations/apex-agents-aa

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL