Provably robust classification of adversarial examples with detection

ICLR(2021)

引用 24|浏览121
暂无评分
摘要
Adversarial attacks against deep networks can be defended against either by building robust classifiers or, by creating classifiers that can \\emph{detect} the presence of adversarial perturbations. Although it may intuitively seem easier to simply detect attacks rather than build a robust classifier, this has not bourne out in practice even empirically, as most detection methods have subsequently been broken by adaptive attacks, thus necessitating \\emph{verifiable} performance for detection mechanisms. In this paper, we propose a new method for jointly training a provably robust classifier and detector. Specifically, we show that by introducing an additional \"abstain/detection\" into a classifier, we can modify existing certified defense mechanisms to allow the classifier to either robustly classify \\emph{or} detect adversarial attacks. We extend the common interval bound propagation (IBP) method for certified robustness under l∞ perturbations to account for our new robust objective, and show that the method outperforms traditional IBP used in isolation, especially for large perturbation sizes. Specifically, tests on MNIST and CIFAR-10 datasets exhibit promising results, for example with provable robust error less than 63.63% and 67.92%, for 55.6% and 66.37% natural error, for ϵ=8/255 and 16/255 on the CIFAR-10 dataset, respectively.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要