2 Matching Annotations
  1. Last 7 days
    1. it is impossible for developers to specify how the Assistant should behave in every possible scenario. In order to play the role effectively, LLMs draw on the knowledge they acquired during pretraining, including their understanding of human behavior

      这句话蕴含着深刻的工程哲学洞见:Anthropic 实际上承认了「规则无法穷举现实」,因此模型必须依赖从人类文本习得的隐性知识来填补规则的空白。这与法律哲学中的「法律无法覆盖所有情况,需要判例和良知补充」高度同构——AI 对齐的本质,不是写更完整的规则,而是培养更好的判断力。

    2. Large language models (LLMs) sometimes appear to exhibit emotional reactions. We investigate why this is the case in Claude Sonnet 4.5 and explore implications for alignment-relevant behavior.

      这篇论文的问题意识本身就极具洞察:大多数 AI 安全研究在追问「模型会不会说谎」,Anthropic 却在追问「模型为什么有情绪」。从「行为纠偏」转向「情绪机制」,意味着对齐研究的范式正在悄然转移——从控制外部输出,到理解内部动机结构,这是从行为主义到认知科学的跨越。