DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Language Models
CoRR(2023)
摘要
Language models (LMs) can be manipulated by adversarial attacks, which
introduce subtle perturbations to input data. While recent attack methods can
achieve a relatively high attack success rate (ASR), we've observed that the
generated adversarial examples have a different data distribution compared with
the original examples. Specifically, these adversarial examples exhibit reduced
confidence levels and greater divergence from the training data distribution.
Consequently, they are easy to detect using straightforward detection methods,
diminishing the efficacy of such attacks. To address this issue, we propose a
Distribution-Aware LoRA-based Adversarial Attack (DALA) method. DALA considers
distribution shifts of adversarial examples to improve the attack's
effectiveness under detection methods. We further design a novel evaluation
metric, the Non-detectable Attack Success Rate (NASR), which integrates both
ASR and detectability for the attack task. We conduct experiments on four
widely used datasets to validate the attack effectiveness and transferability
of adversarial examples generated by DALA against both the white-box BERT-base
model and the black-box LLaMA2-7b model. Our codes are available at
https://anonymous.4open.science/r/DALA-A16D/.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要