Everyone Deserves A Reward: Learning Customized Human Preferences

Pengyu Cheng,Jiawen Xie, Ke Bai, Yong Dai,Nan Du

CoRR（2023）

引用 1|浏览3

暂无评分

摘要

Reward models (RMs) are crucial in aligning large language models (LLMs) with human preferences for improving interaction quality. However, the real world is pluralistic, which leads to diversified human preferences based on different religions, politics, cultures, etc. Moreover, each individual can have their own unique preferences on various topics. Neglecting the diversity of human preferences, current LLM training processes only use a general reward model, which is below satisfaction for customized or personalized application scenarios. To explore customized preference learning, we collect a domain-specific preference (DSP) dataset, which collects preferred responses to each given query from four practical domains. Besides, from the perspective of data efficiency, we proposed a three-stage customized RM learning scheme, whose effectiveness is empirically verified on both general preference datasets and our DSP set. Furthermore, we test multiple training and data strategies on the three learning stages, and have found several ways to better preserve the general preferring ability while training the customized RMs, especially general preference enrichment and customized preference imitation learning. The DSP dataset and code are available at https://github.com/Linear95/DSP.

查看译文

关键词

reward,preferences,learning,human

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要