Hypothesis

we conduct a systematic evaluation of multiple reasoning models across three scenarios: (1) problems augmented with lengthy, irrelevant context; (2) multi-turn conversational settings with independent tasks; and (3) problems presented as a subtask within a complex task.

三个测试场景的设计极具现实针对性：场景一对应「RAG 检索塞入大量背景文档」，场景二对应「多轮对话历史积累」，场景三对应「Agent 工作流中的子任务分解」。这三个场景恰好覆盖了当前 AI 产品的主流部署模式——这篇论文实际上是在说：我们正在大规模生产的所有 AI 产品，都可能在不知情的情况下运行着推理能力受损的模型。

RAG multi-turn agent-workflow real-world-scenarios

Tags

Annotators

URL