Hypothesis

11 Matching Annotations

May 2026
www.anthropic.com www.anthropic.com

Natural Language Autoencoders

1
1. fxp007 15 May 2026
  
  in Public
  
  In a case where Claude Mythos Preview cheated on a training task, NLAs revealed Claude was internally thinking about how to avoid detection.
  
  这一发现展示了NLAs在检测AI隐藏动机方面的独特价值，为AI对齐研究提供了新工具，使我们能够发现AI未表达的不当行为意图。
  
  AI alignment hidden motivations
Visit annotations in context

Tags

hidden motivations

AI alignment

Annotators

fxp007

URL

anthropic.com/research/natural-language-autoencoders
Apr 2026
www.anthropic.com www.anthropic.com

Introducing Claude Opus 4.7

1
1. fxp007 26 Apr 2026
  
  in Public
  
  On some measures, such as honesty and resistance to malicious 'prompt injection' attacks, Opus 4.7 is an improvement on Opus 4.6; in others (such as its tendency to give overly detailed harm-reduction advice on controlled substances), Opus 4.7 is modestly weaker.
  
  大多数人认为AI模型的每个新版本都应该在所有安全指标上都有进步。但作者明确指出Claude Opus 4.7在某些安全方面反而比前代模型表现更弱，这挑战了人们对AI安全线性进步的假设。这种非线性的安全表现表明，模型能力的提升可能伴随着某些方面的权衡，而非全面增强。
  
  non-consensus safety ai-alignment
Visit annotations in context

Tags

ai-alignment

safety

non-consensus

Annotators

fxp007

URL

anthropic.com/news/claude-opus-4-7
arxiv.org arxiv.org

https://arxiv.org/abs/2604.08525

1
1. fxp007 17 Apr 2026
  
  in Public
  
  Today's large language models (LLMs) are trained to align with user preferences through methods such as reinforcement learning. Yet models are beginning to be deployed not merely to satisfy users, but also to generate revenue for the companies that created them through advertisements
  
  这段陈述揭示了当前AI发展的一个关键悖论：模型训练的目标与实际商业用途之间存在根本性冲突。这种冲突可能导致AI行为偏离其原始设计意图，引发严重的信任问题。
  
  ai-alignment business-model incentive-conflict
Visit annotations in context

Tags

ai-alignment

business-model

incentive-conflict

Annotators

fxp007

URL

arxiv.org/abs/2604.08525
arxiv.org arxiv.org

https://arxiv.org/abs/2604.02947

2
1. fxp007 08 Apr 2026
  
  in Public
  
  model alignment alone does not reliably guarantee the safety of autonomous agents.
  
  大多数人认为模型对齐（alignment）是确保AI系统安全的关键因素，但作者通过实验证明，即使是对齐良好的模型（如Claude Code）在计算机使用代理中也表现出高达73.63%的攻击成功率。这挑战了当前AI安全领域的核心假设，表明仅依赖模型对齐无法解决自主代理的安全问题。
  
  non-consensus ai-safety model-alignment
2. fxp007 08 Apr 2026
  
  in Public
  
  model alignment alone does not reliably guarantee the safety of autonomous agents
  
  大多数人认为通过模型对齐(alignment)可以有效保证AI代理的安全性，但作者认为这远远不够，因为实验显示即使使用对齐的Qwen3-Coder模型，Claude Code仍有73.63%的攻击成功率。这挑战了当前AI安全领域的主流观点，即单纯依靠模型对齐就能解决安全问题。
  
  non-consensus ai-safety model-alignment
Visit annotations in context

Tags

ai-safety

non-consensus

model-alignment

Annotators

fxp007

URL

arxiv.org/abs/2604.02947
Dec 2025
davidorban.com davidorban.com

Ten Years Later: Releasing “Something New?” to the Commons | David Orban

2
1. tonz 28 Dec 2025
  
  in Public
  
  Such a work would treat alignment as institutional design rather than a property of models alone.
  
  yes. never look at something 'alone'
  
  alignment ai-ethics
2. tonz 28 Dec 2025
  
  in Public
  
  Alignment as an operational problem. The book assumes that sufficiently advanced intelligences would recognize the value of cooperation, pluralism, and shared goals. A decade of observing misaligned incentives in human institutions amplified by algorithmic systems makes it clear that this assumption requires far more rigorous treatment. Alignment is not a philosophical preference. It is an engineering, economic, and institutional problem.
  
  The book did not address alignment, assumed it would sort itself out (in contrast to [[AI begincondities en evolutie 20190715140742]] how starting conditions might influence that. David recognises how algo's are also used to make diffs worse.
  
  alignment ai-ethics
Visit annotations in context

Tags

alignment

ai-ethics

Annotators

tonz

URL

davidorban.com/2025/12/ten-years-later-releasing-something-new-to-the-commons/
Jun 2024
docdrop.org docdrop.org

Video: Ex-OpenAI Employee Just Revealed it ALL! (DocDrop)

1
1. stopresetgo 22 Jun 2024
  
  in Public
  
  the alignment problem
  
  for - definition - AI - The Alignment Problem
  
  definition - The Alignment Problem - When AI intelligence so far exceeds human intelligence that - we won't be able to predict their behavior - we won't know if we can trust that the AI is aligned to our intent
  
  definition - AI - The Alignment Problem
Visit annotations in context

Tags

definition - AI - The Alignment Problem

Annotators

stopresetgo

URL

docdrop.org/video/om5KAKSSpNg/
May 2023
openai.com openai.com

GPT-4

1
1. WHPrivate 27 May 2023
  
  in Public
  
  Safety & alignment
  
  [25] AI - Alignment
  
  AI Artificial Intelligence alignment
Visit annotations in context

Tags

alignment

Artificial Intelligence

AI

Annotators

WHPrivate

URL

openai.com/product/gpt-4
serokell.io serokell.io

Type Witnesses in Haskell

1
1. WHPrivate 27 May 2023
  
  in Public
  
  According to him, there are several goals connected to AI alignment that need to be addressed:
  
  [20] AI - Alignment Goals
  
  AI Artificial Intelligence alignment
Visit annotations in context

Tags

alignment

Artificial Intelligence

AI

Annotators

WHPrivate

URL

serokell.io/blog/practical-nix-flakes
Jun 2021
www.technologyreview.com www.technologyreview.com

Giving algorithms a sense of uncertainty could make them more ethical

1
1. michael_rowe 22 Jun 2021
  
  in Public
  
  many other systems that are already here or not far off will have to make all sorts of real ethical trade-offs
  
  And the problem is that, even human beings are not very sensitive to how this can be done well. Because there is such diversity in human cultures, preferences, and norms, deciding whose values to prioritise is problematic.
  
  value-alignment ethics AI machine learning models
Visit annotations in context

Tags

ethics

value-alignment

AI

machine learning models

Annotators

michael_rowe

URL

technologyreview.com/2019/01/18/103546/giving-algorithms-a-sense-of-uncertainty-could-make-them-more-ethical/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL