Artificial intelligence (AI) and machine learning (ML) models have become ingrained in everyday life, with applications ranging from biometric verification to virtual assistants. Reliance on ML becomes problematic when models misclassify an observation, potentially leading to adverse consequences. Thus the performance, robustness, and security of these models is critical to their success.
Modern deep learning models have surpassed human performance in some specific tasks (e.g. types of computer vision), though they can be highly vulnerable to adversarial attacks. Such attacks are often crafted using methods similar to those employed by the models that perform the intended classifications. For example, an input can be modified or perturbed in a way that is undetectable to the human senses, and which an ML classifier does not classify as abnormal, but which causes an erroneous, with high-confidence classification.

At left, an unperturbed image correctly classified as a goldfish by an ML model. At right, an image subjected to an imperceptible Fast Gradient Signed Method (FGSM) attack and misclassified as an anemone by an ML model [1].
White Box vs. Black Box Attacks
Adversarial attacks can be categorized as white-box or black-box attacks. White-box attacks represent scenarios in which an adversary has complete knowledge of the model in question, including the architecture and training method. Using this knowledge, an adversary can use the model to make near-imperceptible changes to images to alter predictions, ultimately fooling the model by exploiting its blind spots. Black-box attacks are those in which the adversary has no knowledge of the inner workings of the classification model. The attacker can only observe the model outputs and iteratively change the input to eventually yield an incorrect classification.
Adversarial Attacks and Defense Background
Attacks that create adversarial images can undermine the effectiveness of biometric identification systems. For example, techniques have been developed that can fool gender classification models while keeping face matching capabilities unimpaired [2]. Attacks have also been shown to also be effective against automated speech recognition models wherein an attacker embeds malicious audio commands that sound harmless to the human ear, but which maliciously trigger an action from a speech-activated technology [3].
As adversarial attacks develop and improve, defensive work is ongoing to combat their effectiveness. One strategy uses adversarial examples as a supplement to the training set during initial model training so that the model is less likely to be fooled. Another line of research modifies the original neural networks to help identify some of the features that give them blind spots. Lastly, “network add-on”s attempt to remove perturbations before the model attempts to classify it.
Up Next
Here we have briefly introduced how AI models are vulnerable to adversarial attacks and some strategies being developed to combat them. In our next post, we will dive deeper into specific types of adversarial attacks and share research into how models actually see these adversarial images. The third and final installment of the series will explore defensive techniques used to combat adversarial attacks and discuss our research into how effective these methods are.
This research was performed under the Novetta Machine Learning Center of Excellence.