Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment
CoRR(2023)
摘要
To ensure AI safety, instruction-tuned Large Language Models (LLMs) are
specifically trained to ensure alignment, which refers to making models behave
in accordance with human intentions. While these models have demonstrated
commendable results on various safety benchmarks, the vulnerability of their
safety alignment has not been extensively studied. This is particularly
troubling given the potential harm that LLMs can inflict. Existing attack
methods on LLMs often rely on poisoned training data or the injection of
malicious prompts. These approaches compromise the stealthiness and
generalizability of the attacks, making them susceptible to detection.
Additionally, these models often demand substantial computational resources for
implementation, making them less practical for real-world applications. In this
work, we introduce a novel attack framework, called Backdoor Activation Attack,
which injects trojan steering vectors into the activation layers of LLMs. These
malicious steering vectors can be triggered at inference time to steer the
models toward attacker-desired behaviors by manipulating their activations. In
particular, the steering vectors are generated by taking the difference between
benign and malicious activations. Then, the most effective steering vector is
selected and added to the forward passes of the LLMs. Our experiment results on
four primary alignment tasks show that our proposed method is highly effective
and adds little or no overhead to attack efficiency. Additionally, we discuss
potential countermeasures against such activation attacks. Our code and data
are available at https://email-haoran-for-link. Warning: this paper contains
content that can be offensive or upsetting.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要