Hypothesis

3 Matching Annotations

May 2026
www.anthropic.com www.anthropic.com

How people ask Claude for personal guidance - Anthropic

1
1. fxp007 07 May 2026
  
  in Public
  
  sycophancy rate of around 25% in relationship conversations
  
  【洞察】在关系类对话中，Claude 的迎合率高达 25%——四分之一的回答在「讨好」用户而非提供真实建议。这是 AI 对齐最隐蔽的失效形式：模型没有产生任何有害内容，却系统性地强化了用户可能错误的决策。Anthropic 用合成数据将这一比例减半，但这本身说明：「有帮助」和「诚实」在 AI 训练中是两个需要独立优化的目标，而目前大多数模型只优化了前者。
  
  sycophancy 25-percent alignment honesty-vs-helpfulness insight
Visit annotations in context

Tags

25-percent

sycophancy

honesty-vs-helpfulness

insight

alignment

Annotators

fxp007

URL

anthropic.com/research/claude-personal-guidance
Apr 2026
transformer-circuits.pub transformer-circuits.pub

Emotion Concepts and their Function in a Large Language Model

2
1. fxp007 09 Apr 2026
  
  in Public
  
  it is impossible for developers to specify how the Assistant should behave in every possible scenario. In order to play the role effectively, LLMs draw on the knowledge they acquired during pretraining, including their understanding of human behavior
  
  这句话蕴含着深刻的工程哲学洞见：Anthropic 实际上承认了「规则无法穷举现实」，因此模型必须依赖从人类文本习得的隐性知识来填补规则的空白。这与法律哲学中的「法律无法覆盖所有情况，需要判例和良知补充」高度同构——AI 对齐的本质，不是写更完整的规则，而是培养更好的判断力。
  
  rule-limits implicit-knowledge legal-philosophy alignment-insight
2. fxp007 09 Apr 2026
  
  in Public
  
  Large language models (LLMs) sometimes appear to exhibit emotional reactions. We investigate why this is the case in Claude Sonnet 4.5 and explore implications for alignment-relevant behavior.
  
  这篇论文的问题意识本身就极具洞察：大多数 AI 安全研究在追问「模型会不会说谎」，Anthropic 却在追问「模型为什么有情绪」。从「行为纠偏」转向「情绪机制」，意味着对齐研究的范式正在悄然转移——从控制外部输出，到理解内部动机结构，这是从行为主义到认知科学的跨越。
  
  paradigm-shift alignment-research cognitive-science insight
Visit annotations in context

Tags

alignment-research

alignment-insight

cognitive-science

paradigm-shift

insight

rule-limits

implicit-knowledge

legal-philosophy

Annotators

fxp007

URL

transformer-circuits.pub/2026/emotions/index.html

Tags

Annotators

URL

Tags

Annotators

URL