RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
arxiv(2024)
摘要
State-of-the-art large language models (LLMs) have become indispensable tools
for various tasks. However, training LLMs to serve as effective assistants for
humans requires careful consideration. A promising approach is reinforcement
learning from human feedback (RLHF), which leverages human feedback to update
the model in accordance with human preferences and mitigate issues like
toxicity and hallucinations. Yet, an understanding of RLHF for LLMs is largely
entangled with initial design choices that popularized the method and current
research focuses on augmenting those choices rather than fundamentally
improving the framework. In this paper, we analyze RLHF through the lens of
reinforcement learning principles to develop an understanding of its
fundamentals, dedicating substantial focus to the core component of RLHF – the
reward model. Our study investigates modeling choices, caveats of function
approximation, and their implications on RLHF training algorithms, highlighting
the underlying assumptions made about the expressivity of reward. Our analysis
improves the understanding of the role of reward models and methods for their
training, concurrently revealing limitations of the current methodology. We
characterize these limitations, including incorrect generalization, model
misspecification, and the sparsity of feedback, along with their impact on the
performance of a language model. The discussion and analysis are substantiated
by a categorical review of current literature, serving as a reference for
researchers and practitioners to understand the challenges of RLHF and build
upon existing efforts.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要