As machine learning (ML) becomes more prevalent, bad actors seek to take advantage of our confidence in models’ predictions. An example of this can be found in autonomous vehicle safety, wherein the application of a small sticker to a stop sign causes the vehicles’ ML models to incorrectly classify and interpret the visual cues on the sign [1]. In an exercise to better understand common vulnerabilities in model deployment, we experimented with a handful of prevalent adversarial attacks to understand their viability and demonstrate their potential.
Figure 1: A human can easily discern this as a stop sign with stickers attached, but a trained model mis-identified it as a speed limit sign [1].
In addition to the previously-introduced concept of white box and black box adversarial attacks in Part 1 of this series, other attributes define the adversarial attack environment.
Attacks can be grouped into targeted or non-targeted attacks.
- Targeted adversarial attacks aim to manipulate a misclassification to a specific output, e.g. using a model to misclassify a handwritten “7” as a “4”.
- Non-targeted adversarial attacks aim to cause a misclassification without a specific output class, e.g. causing the misclassification of a handwritten “7” as any other number.
Attacks in the form of image perturbations can be image-specific or universal.
- Image-specific perturbations attack a specific image to cause a misclassification for a specific model.
- A universal perturbation can be applied to any image to cause a misclassification for a specific model.
Given the above framework, we chose to implement our experimental attacks on image classification models using the MNIST and CIFAR-10 datasets.
One-Pixel Attack
We first implemented the “one pixel attack” [2], a black-box, non-targeted, image-specific attack which modifies one pixel in an image to cause a misclassification. The original authors of this attack showed that the attack works 70% of the time on three different models, with an average confidence of 97%. Our implementation based on [3] used a basic convolutional neural network (CNN) written in PyTorch. The attack has three steps:
- Generate hundreds of vectors that contain xy-coordinates and RGB values of pixels in the target image.
- Randomly perturb pixel values and compare the influence of each vector with that of other vectors over many iterations.
- Determine the influence of a pixel by the probabilistic prediction that the model outputs. If the model outputs a high prediction for an incorrect class, the attack was successful.
Two drawbacks to this method limit its value in practice. First, the pixel is usually obviously perturbed, as in the image of the cat below (misclassified as a frog). Second, the attack traditionally works best on lower-resolution images.
Figure 2: Image with one pixel modified, causing cat to be misclassified as a frog.
Fast Gradient Sign Method (FGSM)
We then implemented the fast gradient sign method (FGSM) [4], a more sophisticated white-box attack. The concept behind this targeted, image-specific method is to mathematically convert the image to a different representation, which allows for effective attacks to be more quickly discovered by the computer. To experiment with FGSM, we embedded the network in higher dimensional space where the cost function becomes nearly linear. We then ran a search for the perturbation that maximized the cost function with respect to the L_infinity norm. This attack does not require iteration, and is therefore faster than the one-pixel attack. Further, the perturbed image often looks unchanged to the human eye.
To get a better understanding of the mechanism behind this attack, we implemented attention maps (excitation backpropagation) [5] to highlight the features that contributed most to the classification. Brighter colors contributed more to the classification. This phenomenon is shown in the figure below that shows the original image and the noise added to cause a misclassification.
Figure 3: MNIST images attacked with FGSM; generated attention maps from the model.
The attention map shows, for the unaltered image, that the network focused on features that essentially outline the digit. Once the attack is applied, highlighted features are greatly perturbed, illustrating the effectiveness of this method. However, while the attacked image looks like a “7” to a human eye, those familiar with the MNIST dataset might still be able to tell that an attack was performed.
The figure below shows a more complex example in which the application of an attack is imperceptible to the human eye.
Figure 4: Image from the CIFAR-10 dataset after being attacked with FGSM.
The above attention maps show that the original image was relying heavily on the edges of the goldfish to account for the classification. The perturbed image essentially dampened the effect of all of the edge features for the fish, to the point where the summation of the other features was weighted enough to tip the classification. The subtlety of this attack is what makes it so effective.
UPSET Algorithm as a Black-Box Attack
Adversarial attacks can also be effective in the black-box scenario as demonstrated by the UPSET algorithm [6]. This attack is one of the few black-box attacks that utilize universal perturbations. It essentially solves an optimization problem involving pixel values and random perturbations until a perturbation is found for each target class. That means if your model can target a certain number of classes, each class will have a set perturbation that can be added to an image with a high probability of changing its output class. For this to work in practice, an attacker needs to be able to submit a large volume of test images to the algorithm.
Up Next
We were able to easily reproduce these adversarial attacks, demonstrating only a fraction of the exploitable vulnerabilities inherent in ML models. Knowing when and how to defend models is as important as selecting training data. If practitioners are aware of the conditions under which their models are vulnerable, they can build appropriate plans for defense into model deployment. In our next post, we discuss state-of-the-art methods for defending against attacks.
This research was performed under the Novetta Machine Learning Center of Excellence.
Sources
[1] Kevin Eykholt , Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati , Chaowei Xiao , Atul Prakash, Tadayoshi Kohno, and Dawn Song, “Robust Physical-World Attacks on Deep Learning Visual Classification”, arXiv:1707.08945v5, 2018.[2] Jiawei Su, Danilo Vasconcellos Vargas, Sakurai Kouichi, “One pixel attack for fooling deep neural networks”, arXiv:1710.08864, 2019.
[3] jiayuhan, “Implementation of adversarial attacks and defences”, github.com/jiayunhan/adversarial-pytorch.
[4] Alexey Kurakin and Ian J. Goodfellow and Samy Bengio, “Adversarial examples in the physical world”, CoRR, 2016, http://arxiv.org/abs/1607.02533.
[5] Jianming Zhang and Zhe Lin and Jonathan Brandt and Xiaohui Shen and Stan Sclaroff, Top-down Neural Attention by Excitation Backprop, 2016, arXiv.
[6] S. Sarkar, A. Bansal, U. Mahbub, and R. Chellappa, “UPSET and ANGRI: Breaking High Performance Image Classifiers”, arXiv preprint arXiv:1707.01159, 2017.