Bayesian Reward Models for LLM Alignment
CoRR(2024)
摘要
To ensure that large language model (LLM) responses are helpful and
non-toxic, we usually fine-tune a reward model on human preference data. We
then select policy responses with high rewards (best-of-n sampling) or further
optimize the policy to produce responses with high rewards (reinforcement
learning from human feedback). However, this process is vulnerable to reward
overoptimization or hacking, in which the responses selected have high rewards
due to errors in the reward model rather than a genuine preference. This is
especially problematic as the prompt or response diverges from the training
data. It should be possible to mitigate these issues by training a Bayesian
reward model, which signals higher uncertainty further from the training data
distribution. Therefore, we trained Bayesian reward models using Laplace-LoRA
(Yang et al., 2024) and found that the resulting uncertainty estimates can
successfully mitigate reward overoptimization in best-of-n sampling.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要