West-of-N: Synthetic Preference Generation for Improved Reward Modeling
CoRR(2024)
摘要
The success of reinforcement learning from human feedback (RLHF) in language
model alignment is strongly dependent on the quality of the underlying reward
model. In this paper, we present a novel approach to improve reward model
quality by generating synthetic preference data, thereby augmenting the
training dataset with on-policy, high-quality preference pairs. Motivated by
the promising results of Best-of-N sampling strategies in language model
training, we extend their application to reward model training. This results in
a self-training strategy to generate preference pairs by selecting the best and
worst candidates in a pool of responses to a given query. Empirically, we find
that this approach improves the performance of any reward model, with an effect
comparable to the addition of a similar quantity of human preference data. This
work opens up new avenues of research for improving RLHF for language model
alignment, by offering synthetic preference generation as a solution to reward
modeling challenges.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要