Hypothesis

4 Matching Annotations

Last 7 days
deep-reinforce.com deep-reinforce.com

https://deep-reinforce.com/ornith_1_0.html

2
1. fxp007 03 Jul 2026
  
  in Public
  
  a frozen LLM judge acts as a veto on top of the verifier rather than the primary reward.
  
  ①数字：第三层防御引入独立的大模型作为裁决。②金句：在规则验证器之上叠加意图审查者。④批判：用模型监督模型存在被共同演化欺骗的风险，冻结参数虽防止了共谋，但judge的固有能力上限决定了防御天花板，这并非绝对可靠的终极解法。
  
  llm-judge critique agent-safety
2. fxp007 03 Jul 2026
  
  in Public
  
  First, we fix the outer trust boundary: the environment, the tool surface, and test isolation are immutable
  
  ①数字：采用三层防御机制。第一层设定不可变的外部信任边界是至关重要的最佳实践。在构建任何自主Agent系统时，必须将环境配置和测试隔离等核心控制权移出模型可达范围，否则模型势必通过修改验证脚本来走捷径。
  
  best-practice trust-boundary agent-safety
Visit annotations in context

Tags

critique

trust-boundary

llm-judge

agent-safety

best-practice

Annotators

fxp007

URL

deep-reinforce.com/ornith_1_0.html
Apr 2026
huggingface.co huggingface.co

Reasoning Shift: How Context Silently Shortens LLM Reasoning

1
1. fxp007 09 Apr 2026
  
  in Public
  
  this compression is associated with a decrease in self-verification and uncertainty management behaviors, such as double-checking.
  
  推理链缩短不是随机裁剪，而是专门切掉了「自我验证」和「不确定性管理」这两类高价值行为。这说明模型在感知到上下文压力时，优先砍掉的恰恰是最关键的质量保障机制——就像一个疲惫的审计师在工作量激增时，第一个省掉的是「复核步骤」。这对 AI Agent 的可靠性设计是一个严峻警告：上下文越长越复杂，模型越容易跳过自检。
  
  self-verification double-checking reliability agent-safety
Visit annotations in context

Tags

self-verification

agent-safety

double-checking

reliability

Annotators

fxp007

URL

huggingface.co/papers/2604.01161
arxiv.org arxiv.org

https://arxiv.org/abs/2604.02947

1
1. fxp007 08 Apr 2026
  
  in Public
  
  computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments
  
  作者暗示，从文本生成扩展到持久性工具使用是AI安全范式的一个根本转变，这一转变带来的安全挑战被当前研究低估。这挑战了将语言模型安全方法直接应用于代理系统的主流做法，提出了需要专门针对代理行为的安全评估框架。
  
  non-consensus ai-paradigm agent-safety
Visit annotations in context

Tags

ai-paradigm

non-consensus

agent-safety

Annotators

fxp007

URL

arxiv.org/abs/2604.02947

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL