Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
CoRR(2024)
摘要
Despite the general capabilities of Large Language Models (LLMs) like GPT-4
and Llama-2, these models still request fine-tuning or adaptation with
customized data when it comes to meeting the specific business demands and
intricacies of tailored use cases. However, this process inevitably introduces
new safety threats, particularly against the Fine-tuning based Jailbreak Attack
(FJAttack), where incorporating just a few harmful examples into the
fine-tuning dataset can significantly compromise the model safety. Though
potential defenses have been proposed by incorporating safety examples into the
fine-tuning dataset to reduce the safety issues, such approaches require
incorporating a substantial amount of safety examples, making it inefficient.
To effectively defend against the FJAttack with limited safety examples, we
propose a Backdoor Enhanced Safety Alignment method inspired by an analogy with
the concept of backdoor attacks. In particular, we construct prefixed safety
examples by integrating a secret prompt, acting as a "backdoor trigger", that
is prefixed to safety examples. Our comprehensive experiments demonstrate that
through the Backdoor Enhanced Safety Alignment with adding as few as 11
prefixed safety examples, the maliciously fine-tuned LLMs will achieve similar
safety performance as the original aligned models. Furthermore, we also explore
the effectiveness of our method in a more practical setting where the
fine-tuning data consists of both FJAttack examples and the fine-tuning task
data. Our method shows great efficacy in defending against FJAttack without
harming the performance of fine-tuning tasks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要