Defending Against Unforeseen Failure Modes with Latent Adversarial Training
CoRR(2024)
摘要
AI systems sometimes exhibit harmful unintended behaviors post-deployment.
This is often despite extensive diagnostics and debugging by developers.
Minimizing risks from models is challenging because the attack surface is so
large. It is not tractable to exhaustively search for inputs that may cause a
model to fail. Red-teaming and adversarial training (AT) are commonly used to
make AI systems more robust. However, they have not been sufficient to avoid
many real-world failure modes that differ from the ones adversarially trained
on. In this work, we utilize latent adversarial training (LAT) to defend
against vulnerabilities without generating inputs that elicit them. LAT
leverages the compressed, abstract, and structured latent representations of
concepts that the network actually uses for prediction. We use LAT to remove
trojans and defend against held-out classes of adversarial attacks. We show in
image classification, text classification, and text generation tasks that LAT
usually improves both robustness and performance on clean data relative to AT.
This suggests that LAT can be a promising tool for defending against failure
modes that are not explicitly identified by developers.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要