Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
CoRR(2024)
摘要
Reinforcement Learning from Human Feedback (RLHF) is currently the most
widely used method to align large language models (LLMs) with human
preferences. Existing RLHF methods can be roughly categorized as either
reward-based or reward-free. Novel applications such as ChatGPT and Claude
leverage reward-based methods that first learn a reward model and apply
actor-critic algorithms, such as Proximal Policy Optimization (PPO). However,
in academic benchmarks, state-of-the-art results are often achieved via
reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly
superior to PPO? Why does PPO perform poorly on these benchmarks? In this
paper, we first conduct both theoretical and empirical studies on the
algorithmic properties of DPO and show that DPO may have fundamental
limitations. Moreover, we also comprehensively examine PPO and reveal the key
factors for the best performances of PPO in fine-tuning LLMs. Finally, we
benchmark DPO and PPO across various a collection of RLHF testbeds, ranging
from dialogue to code generation. Experiment results demonstrate that PPO is
able to surpass other alignment methods in all cases and achieve
state-of-the-art results in challenging code competitions.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要