Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
CoRR(2024)
摘要
Despite advances in AI alignment, language models (LM) remain vulnerable to
adversarial attacks or jailbreaking, in which adversaries modify input prompts
to induce harmful behavior. While some defenses have been proposed, they focus
on narrow threat models and fall short of a strong defense, which we posit
should be effective, universal, and practical. To achieve this, we propose the
first adversarial objective for defending LMs against jailbreaking attacks and
an algorithm, robust prompt optimization (RPO), that uses gradient-based token
optimization to enforce harmless outputs. This results in an easily accessible
suffix that significantly improves robustness to both jailbreaks seen during
optimization and unknown, held-out jailbreaks, reducing the attack success rate
on Starling-7B from 84
that RPO has a minor effect on normal LM use, is successful under adaptive
attacks, and can transfer to black-box models, reducing the success rate of the
strongest attack on GPT-4 from 92
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要