Adversarial Preference Optimization
CoRR(2023)
摘要
Human preference alignment is essential to improve the interaction quality of
large language models (LLMs). Existing aligning methods depend on manually
annotated preference data to guide the LLM optimization directions. However, in
practice, continuously updating LLMs raises a distribution gap between
model-generated samples and human-preferred responses, which hinders model
fine-tuning efficiency. To mitigate this issue, previous methods require
additional preference annotation on generated samples to adapt the shifted
distribution, which consumes a large amount of annotation resources. Targeting
more efficient human preference optimization, we propose an adversarial
preference optimization (APO) framework, where the LLM agent and the preference
model update alternatively via a min-max game. Without additional annotation,
our APO method can make a self-adaption to the generation distribution gap
through the adversarial learning process. Based on comprehensive experiments,
we find APO further enhances the alignment performance of baseline methods in
terms of helpfulness and harmlessness.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要