Extensive Self-Contrast Enables Feedback-Free Language Model Alignment
arxiv(2024)
摘要
Reinforcement learning from human feedback (RLHF) has been a central
technique for recent large language model (LLM) alignment. However, its heavy
dependence on costly human or LLM-as-Judge preference feedback could stymie its
wider applications. In this work, we introduce Self-Contrast, a feedback-free
large language model alignment method via exploiting extensive self-generated
negatives. With only supervised fine-tuning (SFT) targets, Self-Contrast
leverages the LLM itself to generate massive diverse candidates, and harnesses
a pre-trained embedding model to filter multiple negatives according to text
similarity. Theoretically, we illustrate that in this setting, merely scaling
negative responses can still effectively approximate situations with more
balanced positive and negative preference annotations. Our experiments with
direct preference optimization (DPO) on three datasets show that, Self-Contrast
could consistently outperform SFT and standard DPO training by large margins.
And as the number of self-generated negatives increases, the performance of
Self-Contrast continues to grow. Code and data are available at
https://github.com/THUDM/Self-Contrast.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要