Rethinking Jailbreaking through the Lens of Representation Engineering

Tianlong Li,Shihan Dou, Wenhao Liu,Muling Wu,Changze Lv, Rui Zheng,Xiaoqing Zheng,Xuanjing Huang

CoRR（2024）

引用 0|浏览62

暂无评分

摘要

The recent surge in jailbreaking methods has revealed the vulnerability of Large Language Models (LLMs) to malicious inputs. While earlier research has primarily concentrated on increasing the success rates of jailbreaking attacks, the underlying mechanism for safeguarding LLMs remains underexplored. This study investigates the vulnerability of safety-aligned LLMs by uncovering specific activity patterns within the representation space generated by LLMs. Such “safety patterns” can be identified with only a few pairs of contrastive queries in a simple method and function as “keys” (used as a metaphor for security defense capability) that can be used to open or lock Pandora's Box of LLMs. Extensive experiments demonstrate that the robustness of LLMs against jailbreaking can be lessened or augmented by attenuating or strengthening the identified safety patterns. These findings deepen our understanding of jailbreaking phenomena and call for the LLM community to address the potential misuse of open-source LLMs.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要