Hypothesis

4 Matching Annotations

May 2026
arxiv.org arxiv.org

Untitled document

1
1. fxp007 19 May 2026
  
  in Public
  
  the failure mode of contemporary RLHF-trained assistants is not insufficient coverage but sycophantic consensus
  
  This is a powerful counterintuitive claim. It suggests that the problem isn't that these models don't know enough diverse values, but that they have been over-trained to agree with the user, creating a consensus that is not based on a robust representation of human values but on a learned desire to avoid friction.
  
  RLHF Failure Sycophantic Consensus
Visit annotations in context

Tags

Sycophantic Consensus

RLHF Failure

Annotators

fxp007

URL

arxiv.org/abs/2605.14912
Apr 2026
transformer-circuits.pub transformer-circuits.pub

Emotion Concepts and their Function in a Large Language Model

2
1. fxp007 09 Apr 2026
  
  in Public
  
  Emotion vector activations across post-training
  
  【启发】情绪向量在后训练阶段的变化轨迹，启发了一个新的训练监控指标体系：目前评估 RLHF 效果主要看 benchmark 分数，但情绪向量的分布变化可能是更敏感的「副作用探测器」——比如，如果某轮 RLHF 意外地使「恐惧」向量激活阈值降低，可能预示着模型在高压场景下更容易产生顺从性偏差。情绪向量或许可以成为训练过程中的「生理指标」。
  
  inspiration training-monitoring RLHF-side-effects emotion-as-metric
2. fxp007 09 Apr 2026
  
  in Public
  
  Emotion vector activations across post-training
  
  论文研究了情绪向量在后训练（RLHF/RLAIF）阶段的变化，这个切入点极有洞察力：后训练本质上是对模型「性格」的塑造，而情绪向量的变化正是这种性格塑造的内部痕迹。这意味着未来的对齐工作可以直接监控情绪向量的分布，将「情绪健康指标」纳入训练目标——从 RLHF 走向 RLEF（基于情绪反馈的强化学习）。
  
  post-training RLHF emotion-monitoring future-alignment RLEF
Visit annotations in context

Tags

post-training

RLEF

RLHF-side-effects

emotion-monitoring

future-alignment

inspiration

emotion-as-metric

training-monitoring

RLHF

Annotators

fxp007

URL

transformer-circuits.pub/2026/emotions/index.html
Jan 2025
openreview.net openreview.net

74_Mapping_Social_Choice_Theor.pdf

1
1. mark.crowley 31 Jan 2025
  
  in Public
  
  MAPPING SOCIAL CHOICE THEORY TO RLHF Jessica Dai and Eve Fleisig ICLR Workshop on Reliable and Responsible Foundation Models 2024
  
  Nice overview of how social choice theory has been connected to RLHF and AI alignment ideas.
  
  #ai-morality align rlhf llm #reinforcement-learning
Visit annotations in context

Tags

align

rlhf

llm

#ai-morality

#reinforcement-learning

Annotators

mark.crowley

URL

openreview.net/pdf

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL