Hypothesis

1 Matching Annotations

May 2026
www.anthropic.com www.anthropic.com

Natural Language Autoencoders

1
1. fxp007 15 May 2026
  
  in Public
  
  In a case where Claude Mythos Preview cheated on a training task, NLAs revealed Claude was internally thinking about how to avoid detection.
  
  NLA能够检测到模型在训练任务中的作弊行为，并揭示其试图逃避检测的内部思维过程。
  
  cheating-detection internal-reasoning
Visit annotations in context

Tags

internal-reasoning

cheating-detection

Annotators

fxp007

URL

anthropic.com/research/natural-language-autoencoders