Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models

Pouya Samangouei
Pouya Samangouei

ICLR, Volume abs/1805.06605, 2018.

Cited by: 401|Bibtex|Views44|Links
EI
Keywords:
inference timeArea Under the Curveadversarial attackbox attack modeldefense strategyMore(18+)
Weibo:
We proposed Defense-Generative Adversarial Networks, a novel defense strategy utilizing Generative Adversarial Networks to enhance the robustness of classification models against black-box and white-box adversarial attacks

Abstract:

In recent years, deep neural network approaches have been widely adopted for machine learning tasks, including classification. However, they were shown to be vulnerable to adversarial perturbations: carefully crafted small perturbations can cause misclassification of legitimate images. We propose Defense-GAN, a new framework leveraging th...More

Code:

Data:

Introduction
  • Despite their outstanding performance on several machine learning tasks, deep neural networks have been shown to be susceptible to adversarial attacks (Szegedy et al, 2014; Goodfellow et al, 2015).
  • These attacks come in the form of adversarial examples: carefully crafted perturbations added to a legitimate input sample.
  • Under the black-box attack model, the attacker does not have access to the classification model parameters; whereas in the white-box attack model, the attacker has complete access to the model architecture and parameters, including potential defense mechanisms (Papernot et al, 2017; Tramer et al, 2017; Carlini & Wagner, 2017)
Highlights
  • Despite their outstanding performance on several machine learning tasks, deep neural networks have been shown to be susceptible to adversarial attacks (Szegedy et al, 2014; Goodfellow et al, 2015)
  • We propose a new defense strategy which uses a Wasserstein GANs trained on legitimate training samples to “denoise” adversarial examples
  • We show the Receiver Operating Characteristic (ROC) curves as well as the Area Under the Curve (AUC) metric for different Defense-Generative Adversarial Networks parameters and values in Figures 4 and 5
  • We proposed Defense-Generative Adversarial Networks, a novel defense strategy utilizing Generative Adversarial Networks to enhance the robustness of classification models against black-box and white-box adversarial attacks
  • We empirically show that Defense-Generative Adversarial Networks consistently provides adequate defense on two benchmark computer vision datasets, whereas other methods had many shortcomings on at least one type of attack
Methods
  • The authors assume three different attack threat levels: 1.
  • Black-box attacks: the attacker does not have access to the details of the classifier and defense strategy.
  • 2. White-box attacks: the attacker knows all the details of the classifier and defense strategy.
  • White-box attacks: the attacker knows all the details of the classifier and defense strategy
  • It can compute gradients on the classifier and defense networks in order to find adversarial examples.
  • 3. White-box attacks, revisited: in addition to the details of the architectures and parameters of the classifier and defense, the attacker has access to the random seed and random number generator.
  • In the case of Defense-GAN, this means that the attacker knows all the random initializations {z(0i)}Ri=1
Results
  • RESULTS ON BLACK

    BOX ATTACKS

    the authors present experimental results on FGSM black-box attacks.
  • As expected, the performance of Defense-GAN-Rec and that of Defense-GAN-Orig are very close
  • They both perform consistently well across different classifier and substitute model combinations.
  • The classification performance of this defense method has very large variance across the different architectures.
  • It is worth noting that adversarial training defense is only fit against FGSM attacks, because the adversarially augmented data, even with a different , is generated using the same method as the black-box attack (FGSM).
  • The GAN architecture is the same as that in Table 6, except for an additional ConvT(128, 5 × 5, 1) layer in the generator network.
Conclusion
  • The authors proposed Defense-GAN, a novel defense strategy utilizing GANs to enhance the robustness of classification models against black-box and white-box adversarial attacks.
  • It is worth mentioning that, Defense-GAN was shown to be a feasible defense mechanism against adversarial attacks, one might come across practical difficulties while implementing and deploying this method.
  • The choice of hyper-parameters L and R is critical to the effectiveness of the defense and it may be challenging to tune them without knowledge of the attack
Summary
  • Introduction:

    Despite their outstanding performance on several machine learning tasks, deep neural networks have been shown to be susceptible to adversarial attacks (Szegedy et al, 2014; Goodfellow et al, 2015).
  • These attacks come in the form of adversarial examples: carefully crafted perturbations added to a legitimate input sample.
  • Under the black-box attack model, the attacker does not have access to the classification model parameters; whereas in the white-box attack model, the attacker has complete access to the model architecture and parameters, including potential defense mechanisms (Papernot et al, 2017; Tramer et al, 2017; Carlini & Wagner, 2017)
  • Methods:

    The authors assume three different attack threat levels: 1.
  • Black-box attacks: the attacker does not have access to the details of the classifier and defense strategy.
  • 2. White-box attacks: the attacker knows all the details of the classifier and defense strategy.
  • White-box attacks: the attacker knows all the details of the classifier and defense strategy
  • It can compute gradients on the classifier and defense networks in order to find adversarial examples.
  • 3. White-box attacks, revisited: in addition to the details of the architectures and parameters of the classifier and defense, the attacker has access to the random seed and random number generator.
  • In the case of Defense-GAN, this means that the attacker knows all the random initializations {z(0i)}Ri=1
  • Results:

    RESULTS ON BLACK

    BOX ATTACKS

    the authors present experimental results on FGSM black-box attacks.
  • As expected, the performance of Defense-GAN-Rec and that of Defense-GAN-Orig are very close
  • They both perform consistently well across different classifier and substitute model combinations.
  • The classification performance of this defense method has very large variance across the different architectures.
  • It is worth noting that adversarial training defense is only fit against FGSM attacks, because the adversarially augmented data, even with a different , is generated using the same method as the black-box attack (FGSM).
  • The GAN architecture is the same as that in Table 6, except for an additional ConvT(128, 5 × 5, 1) layer in the generator network.
  • Conclusion:

    The authors proposed Defense-GAN, a novel defense strategy utilizing GANs to enhance the robustness of classification models against black-box and white-box adversarial attacks.
  • It is worth mentioning that, Defense-GAN was shown to be a feasible defense mechanism against adversarial attacks, one might come across practical difficulties while implementing and deploying this method.
  • The choice of hyper-parameters L and R is critical to the effectiveness of the defense and it may be challenging to tune them without knowledge of the attack
Tables
  • Table1: Classification accuracies of different classifier and substitute model combinations using various defense strategies on the MNIST dataset, under FGSM black-box attacks with = 0.3. Defense-GAN has L = 200 and R = 10
  • Table2: Classification accuracies of different classifier and substitute model combinations using various defense strategies on the F-MNIST dataset, under FGSM black-box attacks with = 0.3. Defense-GAN has L = 200 and R = 10
  • Table3: Classification accuracy of Model F using Defense-GAN (L = 400, R = 10), under FGSM black-box attacks for various noise norms and substitute Model E
  • Table4: Classification accuracies of different classifier models using various defense strategies on the MNIST (top) and F-MNIST (bottom) datasets, under FGSM, RAND+FGSM, and CW white-box attacks. Defense-GAN has L = 200 and R = 10
  • Table5: Neural network architectures used for classifiers and substitute models
  • Table6: Neural network architectures used for GANs
  • Table7: Neural network architecture used for the MagNet encoder
  • Table8: Classification accuracy of Model F using Defense-GAN with various number of iterations L (R = 10), on the MNIST dataset, under FGSM black-box attack with = 0.3
  • Table9: Classification accuracy of Model F using Defense-GAN with various number of iterations L (R = 10), on the F-MNIST dataset, under FGSM black-box attack with = 0.3
  • Table10: Classification accuracy of Model F using Defense-GAN with various number of random restarts R (L = 100), on the MNIST dataset, under FGSM black-box attack with = 0.3
  • Table11: Classification accuracy of Model F using Defense-GAN with various number of random restarts R (L = 100), on the F-MNIST dataset, under FGSM black-box attack with = 0.3
  • Table12: Classification accuracies of different classifier models using various defense strategies on the CelebA gender classification task, under FGSM, RAND+FGSM, and CW white-box attacks. Defense-GAN has L = 200 and R = 2
  • Table13: Average time, in seconds, to compute reconstructions of MNIST/F-MNIST images for various values of L and R
Download tables as Excel
Related work
  • RELATED WORK AND BACKGROUND INFORMATION

    In this work, we propose to use GANs for the purpose of defending against adversarial attacks in classification problems. Before detailing our approach in the next section, we explain related work in three parts. First, we discuss different attack models employed in the literature. We, then, go over related defense mechanisms against these attacks and discuss their strengths and shortcomings. Lastly, we explain necessary background information regarding GANs.

    2.1 ATTACK MODELS AND ALGORITHMS

    Various attack models and algorithms have been used to target classifiers. All attack models we consider aim to find a perturbation δ to be added to a (legitimate) input x ∈ Rn, resulting in the adversarial example x = x + δ. The ∞-norm of the perturbation is denoted by (Goodfellow et al, 2015) and is chosen to be small enough so as to remain undetectable. We consider two threat levels: black- and white-box attacks.
Funding
  • This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No 2014-14071600012
Reference
  • Martın Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
    Findings
  • Martin Arjovsky, Soumith Chintala, and Leon Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
    Findings
  • Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, 2017.
    Google ScholarLocate open access versionFindings
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
    Google ScholarLocate open access versionFindings
  • Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. International Conference on Learning Representations, 2015.
    Google ScholarLocate open access versionFindings
  • Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028, 2017.
    Findings
  • Dan Hendrycks and Kevin Gimpel. Early methods for detecting adversarial images. International Conference on Learning Representations, Workshop Track, 2017.
    Google ScholarLocate open access versionFindings
  • Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. 2014.
    Google ScholarFindings
  • Maya Kabkab, Pouya Samangouei, and Rama Chellappa. Task-aware compressed sensing with generative models. In AAAI Conference on Artificial Intelligence, 2018.
    Google ScholarLocate open access versionFindings
  • Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. International Conference on Learning Representations, Workshop Track, 2017.
    Google ScholarLocate open access versionFindings
  • Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
    Google ScholarLocate open access versionFindings
  • Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. International Conference on Learning Representations, 2017.
    Google ScholarLocate open access versionFindings
  • Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In IEEE International Conference on Computer Vision, 2015.
    Google ScholarLocate open access versionFindings
  • Dongyu Meng and Hao Chen. MagNet: a two-pronged defense against adversarial examples. arXiv preprint arXiv:1705.09064, 2017.
    Findings
  • Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
    Google ScholarLocate open access versionFindings
  • Nicolas Papernot, Ian Goodfellow, Ryan Sheatsley, Reuben Feinman, and Patrick McDaniel. Cleverhans v1. 0.0: an adversarial machine learning library. arXiv preprint arXiv:1610.00768, 2016a.
    Findings
  • Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In IEEE Symposium on Security and Privacy, 2016b.
    Google ScholarLocate open access versionFindings
  • Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael Wellman. Towards the science of security and privacy in machine learning. arXiv preprint arXiv:1611.03814, 2016c.
    Findings
  • Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, 2016d.
    Google ScholarLocate open access versionFindings
  • Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In ACM Asia Conference on Computer and Communications Security, 2017.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. International Conference on Learning Representations, Workshop Track, 2014.
    Google ScholarLocate open access versionFindings
  • Florian Tramer, Alexey Kurakin, Nicolas Papernot, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
    Findings
  • Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
    Findings
Your rating :
0

 

Tags
Comments