top of page
  • Nazar Kaminskyi

Protection against adversarial examples in image classification (part 1)

Nowadays, you will hardly find people who have not heard anything about artificial intelligence. Machine Learning (and especially Deep Learning) have achieved incredible results in problems that used to be "too" difficult for computers. The variety of neural network applications, a specific technology of artificial intelligence, is impressive - image recognition systems, autonomous cars, smart robots, voice assistants, forecasting, fraud detection, machine translation, natural language processing, music and text generation, recommendation systems, bioinformatics... and this list is far from exhaustive.

Some of these systems are critical in global system solutions and their failure can lead to undesirable or even tragic consequences.

The great popularity of AI undoubtedly involves certain consequences and raises new problems. One such problem of neural networks was discovered by Szegedy et al. [1] in 2014. The authors found that several machine learning models, including modern neural networks, are vulnerable to precisely created noise. After this article, researchers started to actively study this vulnerability of neural networks called adversarial examples.

In this blog post, part 1, we will understand this vulnerability and get acquainted with existing methods of protection against them. In part 2, to be released soon, we will describe in detail the experiments and results in protecting image classification tasks against adversarial examples.

Adversarial Examples

In general, adversarial examples can be generated in all areas listed in the introduction, and information can be found in the following blog as well as in articles [2, 3]. For a better visual perception, we will focus on the problem of image classification. As mentioned above, Szegedy et al found that a little, correctly calculated noise for the input image can easily fool the state of art neural network. The following image illustrates this phenomenon.

It is easy to reach high confidence in the incorrect classification of an adversarial example, as shown in Explaining and Harnessing Adversarial Examples, Goodfellow et al, ICLR 2015.

The main danger is that for human eyes these two images are identical, but the network misclassifies the second one.

Therefore, the main idea of this phenomenon can be expressed by the following formula:

We will not go into details, because you can find a great explanation for the generation of the simplest harmful examples in the following blog, and advanced readers can find all the information in articles [2] and [3].

Let us review the basic concepts of adversarial examples in accordance with [3]:

1. Adversarial Falsification (Error type):

a. False positive attacks (Type I Error) - for image classification, this example looks like noise to a human, however, the network classifies it as a certain object with high accuracy;

b. False negative attacks (Type II Error) - for image classification, this example can be easily identified by a human, however, the deep neural network cannot properly classify it.

2. Adversary’s Knowledge:

a. white box - the attack requires information about the internal structure (architecture) of the classifier: hyperparameters, activation functions, hidden layers, weights, etc;

b. black box - the attack does not require information about the internal structure (architecture) of the classifier, but available only data and output of the classifier;

c. semi-white (gray) box - an attacker is assumed to know the architecture of the target model, but to have no access to the weights in the network.

3. Adversarial Specificity:

a. Targeted attacks misguide deep neural networks to a specific class;

b. Non-targeted attacks do not assign a specific class to the neural network output.

4. Attack Frequency:

a. One-time attacks;

b. Iterative attacks.

5. Perturbation Scope:

a. Individual attacks generate different perturbations for each clean input;

b. Universal attacks.

6. Perturbation Limitation

a.Optimized Perturbation sets perturbation as the goal of the optimization problem.

b. Constraint Perturbation sets perturbation as the constraint of the optimization problem.

Today, there are dozens of ways to generate this special noise, ie to generate adversarial examples. Among them: Fast Gradient Sign Method (FGSM), Iterative Gradient Sign Method (IGSM), Jacobian Saliency Map Attack (JSMA), DeepFool (DF), One-Step Target Class Method (OSTCM), Basic Iterative Method (BIM), Iterative Least-Likely Class Method (ILLC), Compositional Pattern-Producing Network-Encoded Evolutionary Algorithm (CPPN EA), Carlini and Wagner’s Attack (C&W), Universal Perturbation (UP), Feature Adversary (FA), Hot/Cold method (H/C), Model-based EnsemblingAttack (MEA), Ground-Truth Attack (GTA), Targeted Audio Adversarial Examples (TAAE), Black-box Zeroth Order Optimization (ZOO), One Pixel Attack (OPA), Natural GAN (NGAN), Zero-Query Attacks (ZQA), Natural Evolution Strategies (NES), Boundary Attack (BA), Greedy Search Algorithm (GSA), Genetic Attack (GA), Improved Genetic Algorithm (IGA), Probability Weighted Word Saliency (PWWS), Replacement Insertion & Removal of Words (RI&RoW), Real-World Noise (RWN), Genetic Algorithms and Gradient Estimation (GA&GE).

A good toolbox for the adversarial examples generation is described here.

Defense methods

The discovery of adversarial examples has helped to better understand the nature of neural networks. Scientists work on better explaining the behavior of neural networks and try to find different ways to deal with their vulnerabilities.

Here we consider the main types of defense. A detailed analysis can be found in [2, 3].

There are 2 main defense strategies: reactive - detect adversarial examples after deep neural networks are built and proactive - make deep neural networks more robust before attackers generate adversarial examples [2].

Reactive defense:

Adversarial Detecting

■ Most often, a detector is a small and straightforward neural network of binary classification that classifies input as a clean input or an adversarial example.

Input Reconstruction

■ The basic idea is that adversarial examples can be transformed into clean data via reconstruction. After transformation, the adversarial examples will not affect the prediction of deep learning models.

Network Verification

■ This method checks whether the input data violates the properties of the neural network that were determined during the training phase. Verifying properties of deep neural networks is a promising solution to tackle adversarial examples, because it may detect new unseen attacks.

Proactive defense:

Network Distillation

■ The main idea is to train the model twice, initially using the one-hot ground truth labels but ultimately using the initial model’s probability as outputs to enhance the robustness. So we can hide the gradient information of the model, in order to confuse the adversaries, since most attack algorithms are based on the classifier′s gradient information.

Adversarial (Re)training

■ Adversarial training is an intuitive defense method against adversarial samples, which attempts to improve the robustness of a neural network by training it with adversarial samples.

Classifier Robustifying

■ Design robust architectures of deep neural networks to prevent adversarial examples.

Of course, today there is no perfect method; a good defense depends on the specific task or data to be protected. All of the above approaches, as well as their combinations, have advantages and disadvantages but they have prospects for development and improvment.


We now share a common general understanding of adversarial examples and of the most common methods of protection against them. Major questions still remain, like: will we be able to build an optimal classifier that would be efficient enough to resist various attacks, and, most importantly, new types of attacks? Engineers also have to overcome one of the most important properties of AI applications - transferability. It means that the adversarial examples generated to target one model also have a high probability of misleading other models. There is still much work ahead for machine learning engineers!

Coming Next

Stay tuned: in part 2, we will share with you views on the robustness of neural networks using a combination of proactive and reactive methods.


* Follow us on LinkedIn for next blog updates:

* Interested in our skills? Let's discuss your projects together:


* Our public Github repository:



  1. Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru, Goodfellow, Ian J., and Fergus, Rob. Intriguing properties of neural networks. ICLR, abs/1312.6199, 2014b. URL:

  2. Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li Adversarial Examples: Attacks and Defenses for Deep Learning. National Science Foundation Center for Big Learning, University of Florida. – 2018 URL:

  3. Han Xu, Yao Ma, Haochen Liu, Debayan Deb, Hui Liu, Jiliang Tang, Anil K. Jain: Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. Michigan State University. - 2019 URL:


bottom of page