naively designed dense per-turn rewards degrade performance by up to 14 percentage points due to misalignment between reward discriminativeness and advantage direction
大多数人认为更密集的每回合奖励信号会强化学习性能,但作者发现精心设计的密集奖励实际上会降低性能达14个百分点,因为奖励的判别性与优势方向不匹配。这一发现挑战了强化学习中'奖励越多越好'的直觉认知。