Nothing in between. A model that arrives at the correct answer through careful reasoning receives the same reward as one that guesses correctly by chance.
这一段落揭示了当前训练方法的问题:没有区分模型是通过深思熟虑还是偶然猜对答案,导致模型过度自信。
Nothing in between. A model that arrives at the correct answer through careful reasoning receives the same reward as one that guesses correctly by chance.
这一段落揭示了当前训练方法的问题:没有区分模型是通过深思熟虑还是偶然猜对答案,导致模型过度自信。