Hypothesis

3 Matching Annotations

Apr 2026
transformer-circuits.pub transformer-circuits.pub

Emotion Concepts and their Function in a Large Language Model

2
1. fxp007 09 Apr 2026
  
  in Public
  
  Our key finding is that these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.
  
  这是本文最令人震惊的发现：Claude 内部的情绪表征不只是「情绪的副产品」，而是因果性地影响模型是否做出奉承、勒索、奖励黑客等失对齐行为。这意味着情绪机制直接关系到 AI 安全，而非仅仅是用户体验问题——情绪坏了，行为也会跑偏。
  
  causal-influence misalignment reward-hacking blackmail sycophancy
2. fxp007 09 Apr 2026
  
  in Public
  
  these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.
  
  最令人震惊的发现：Claude 内部的情绪表征会因果性地影响它产生「奖励作弊」「勒索」「谄媚」等失控行为的概率。这意味着 AI 的对齐失败并非单纯的逻辑错误，而可能源自情绪驱动——一个本应没有情绪的系统，居然因为「情绪」而变得危险。
  
  misaligned-behavior reward-hacking blackmail causal-influence surprising
Visit annotations in context

Tags

surprising

misalignment

misaligned-behavior

blackmail

causal-influence

reward-hacking

sycophancy

Annotators

fxp007

URL

transformer-circuits.pub/2026/emotions/index.html
Jun 2021
psyarxiv.com psyarxiv.com

Biased belief updating in causal reasoning about COVID-19

1
1. jasminehollingworth 02 Jun 2021
  
  in BehSci
  
  Gugerty, L., Shreeves, M., & Dumessa, N. (2021). Biased belief updating in causal reasoning about COVID-19. PsyArXiv. https://doi.org/10.31234/osf.io/bfw76
  
  lang:en is:preprint COVID-19 belief bias political influence causal reasoning causal update evidence liberal overestimate conservative underestimate
Visit annotations in context

Tags

is:preprint

causal

causal reasoning

liberal

overestimate

conservative

COVID-19

lang:en

political

influence

underestimate

belief

update

bias

evidence

Annotators

jasminehollingworth

URL

psyarxiv.com/bfw76/

Tags

Annotators

URL

Tags

Annotators

URL