Introduction to Adversarial Attack Defense Methods
In our previous blog posts, we introduced and experimented with adversarial attacks against deep learning models. Here, we explore ways to mitigate or decrease the effectiveness of some of these attacks with three defense methods:
1. Modifying the feature space of the neural network itself
2. Modifying an input to a network before it is used for inference
3. Modifying the training data to account for adversarial inputs
Modifying the Feature Space of a Neural Network
In our first experiment, we implemented a technique [1] that modifies the feature space of the weights for a trained neural network. The technique separates the feature space of each class into more distinct clusters. This forces the outputs to be located in distinct regions in the higher-dimensional feature space, effectively hindering black-box and white-box attacks.
We tested the technique in [1] to see how well this method performed on the MNIST dataset. Our test was based on the projected gradient descent (PGD) attack without adversarial training. PGD is a more robust, iterative version of the FGSM attack discussed in our previous post. When the PGD attack is used on the MNIST data, classification accuracy is 11.9% [1]. Once the method discussed here is implemented, classification accuracy jumps 69.5%, compared to 99.53% with no attack at all.
We can visually explore this defense from the implementation [2]. These images show the 2D activations in the second-to-last layer of the network. The image on the left was faced with a PGD attack. The impact of the increased dimensions from the defense can be seen on the right.
The clusters for each class are more clearly separated after the adversarial defense method is implemented. This makes it much more difficult for an adversarial input to induce misclassification.
Modifying an Input to a Network – Defense-GAN as a Defense Method
The next defense method we explored was Defense-GAN [3], a generative adversarial network (GAN) trained to generate a distribution of images based on “clean” images not previously subject to an attack. Given an image to be tested for inference, the network finds a similar image created by the generator [4]. The model then runs this generated image through the classifier.
During training, the generator produces images that get sent to the discriminator to determine if the image is real or simulated. Feedback is then sent to the generator to improve simulated image results. During testing, a real image is mapped to the most similar generated image, which is the input into the model classifier.
Figure 2: Defense-GAN process.
The drawback to this method is that GANs are notoriously hard to train, such that transferability of this method to general models would be difficult. However, in cases where the GAN can be trained, the method performs very well.
On MNIST data, defense-GAN performed well against the FGSM attack, recovering much of the accuracy lost from the attack.
Image | Classification Accuracy |
Native | 99% |
After FGSM Attack | 63% |
FGSM Attack countered by defense-GAN | 93% |
The model demonstrates robustness to attack and the method can be easily applied against any attack, since no attack method was assumed when training the GAN.
Modifying the Training Data – Adversarial Training as a Supplemental Feature
The above methods can be improved by adding in adversarial training – taking inputs that have already been attacked and including them as part of the original training set. Adversarial training on its own is not sufficient, as it is impossible to anticipate every type of attack that could be performed on the images. This method can be seen as a complementary enhancement to other types of defenses.
Conclusion
Defenses against adversarial attacks will be a constantly-evolving research area, as adversaries continue to develop new ways to fool defenses. The methods discussed here are leading the way in making neural networks robust to attack so that they can be relied upon in critical applications.
This research was performed under the Novetta Machine Learning Center of Excellence.
Sources
[1] Mustafa, Aamir and Khan, Salman and Hayat, Munawar and Goecke, Roland and Shen, Jianging and Shao, Ling, “Adversarial Defense by Restricting the Hidden Space of Deep Neural Networks”, arXiv preprint arXiv:1904.00887, 2019[2] Mustafa, Aamir, “Adversarial Defense by Restricting the Hidden Space of Deep Neural Networks”, github.com/aamir-mustafa/pcl-adversarial-defense.
[3] Samangouei, Pouya and Kabkab, Maya and Chellappa, Rama, “Defense-GAN: Protecting classifiers against adversarial attacks using generative models”, International Conference on Learning Representations, 2018.
[4] Due to the difficulty in training GAN’s in practice, Wasserstein GANs (WGANs) were introduced, which use a modified loss function which makes the network behave more desirably. The defense-GAN being discussed here utilizes a WGAN as the network architecture.