On Softmax Direct Preference Optimization for Recommendation
CoRR(2024)
摘要
Recommender systems aim to predict personalized rankings based on user
preference data. With the rise of Language Models (LMs), LM-based recommenders
have been widely explored due to their extensive world knowledge and powerful
reasoning abilities. Most of the LM-based recommenders convert historical
interactions into language prompts, pairing with a positive item as the target
response and fine-tuning LM with a language modeling loss. However, the current
objective fails to fully leverage preference data and is not optimized for
personalized ranking tasks, which hinders the performance of LM-based
recommenders. Inspired by the current advancement of Direct Preference
Optimization (DPO) in human preference alignment and the success of softmax
loss in recommendations, we propose Softmax-DPO (S-DPO) to instill
ranking information into the LM to help LM-based recommenders distinguish
preferred items from negatives, rather than solely focusing on positives.
Specifically, we incorporate multiple negatives in user preference data and
devise an alternative version of DPO loss tailored for LM-based recommenders,
connected to softmax sampling strategies. Theoretically, we bridge S-DPO with
the softmax loss over negative sampling and find that it has a side effect of
mining hard negatives, which assures its exceptional capabilities in
recommendation tasks. Empirically, extensive experiments conducted on three
real-world datasets demonstrate the superiority of S-DPO to effectively model
user preference and further boost recommendation performance while mitigating
the data likelihood decline issue of DPO. Our codes are available at
https://github.com/chenyuxin1999/S-DPO.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要