Rethinking Jailbreaking through the Lens of Representation Engineering
CoRR(2024)
摘要
The recent surge in jailbreaking methods has revealed the vulnerability of
Large Language Models (LLMs) to malicious inputs. While earlier research has
primarily concentrated on increasing the success rates of jailbreaking attacks,
the underlying mechanism for safeguarding LLMs remains underexplored. This
study investigates the vulnerability of safety-aligned LLMs by uncovering
specific activity patterns within the representation space generated by LLMs.
Such “safety patterns” can be identified with only a few pairs of contrastive
queries in a simple method and function as “keys” (used as a metaphor for
security defense capability) that can be used to open or lock Pandora's Box of
LLMs. Extensive experiments demonstrate that the robustness of LLMs against
jailbreaking can be lessened or augmented by attenuating or strengthening the
identified safety patterns. These findings deepen our understanding of
jailbreaking phenomena and call for the LLM community to address the potential
misuse of open-source LLMs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要