• Oleksandr Lysenko

Unsupervised learning models for visual inspection

Manufacturing businesses often seek to automate their production processes, but some stages of this automation are still lingering behind due to technical barriers and/or insufficient ROI.

One of these stages is visual inspection, with old-designed computer vision solutions frequently not being able to compete with the efficiency of the human eye: they are not reliable enough, answer only partially functional needs, can be too complex or expensive to develop and maintain.

However, the recent progresses in machine learning, especially in deep neural networks, are rapidly changing this visual inspection paradigm.

In this blog article, the Preste team will examine a particular type of models that can be applied to visual inspection tasks even when defects are unknown or very different from one another. We will concentrate on two types of models among the most recents (Autoencoders and Vision Transformers), both trained via unsupervised learning. We will test their capacities on datasets (images), comment on their performance and the tracks to explore to integrate them into a real production process.

Current NN models (supervised learning)

Let us start first with a general overview on the way contemporary machine learning models address visual inspection challenges: they use classification, segmentation, and object detection neural networks.

Each of these tools performs better in specific cases, depending on the shape, texture, size or other aspects of the products or the defects to be checked. A mixed approach, combining several of these tools, can also be the solution to tackle a particular use case. Experienced ML engineers will know how to customize the most relevant tools depending on these specific use cases. When adequately managed, neural networks can already solve many issues that traditional computer vision approaches could not have tackled.

Yet, all of these approaches require supervised learning: a visual definition of what is a "good" and a "bad" product is needed in order to adequately train the models and build their generalization capacities for the future. This requirement is not a big issue when the defect rate is high (several percent of the total production), as typical defects are known and sufficiently documented to build a representative database in a reasonable timeframe. But when the defects are seldom or visually different from one another, setting up a visual inspection model with good generalization capacities could be a much tougher challenge.

Universal detectors (unsupervised learning)

This is where unsupervised learning models can be of help. New families of deep neural networks have emerged in the field of computer vision in the past months: they seem to be suitable for cases where the product defect rate is low or when new types of unforeseen defects can emerge during the production life cycle.

These new families are called Autoencoders (with reconstruction error) and Vision Transformers. These algorithms can act as universal detectors, trained only on "good" examples (unsupervised learning).

For this reason, they can greatly simplify the quality control process. Hence the interest to have a closer look at their capacities.

Our tests

To perform our tests, we applied our two models (Autoencoder and Vision transformer) to two different datasets (two different products): transistors and sewers.

The idea was to check how the two approaches behave in different environments. The first environment (transistor dataset) is homogeneous with excellent lighting conditions, stable camera position and a few defect types.

The second environment (sewer dataset) is much more diverse: dozens of pipes which differ in color, shape, material and a wide variety of 17 defect classes. This can be seen as a good limit test case, as it is less favorable than most of the cases one would encounter in real production conditions.

Let us now have a closer look at our 2 model candidates.

Model 1 : Autoencoder with SSIM Loss

A few years ago, a family of deep Autoencoder networks started to gain extreme popularity for a wide range of tasks and applications - one of them being anomaly detection.

For the purpose of our tests, we trained an Autoencoder on a set of images free from defects, to help it understand what a "good" product looks like - the training dataset is defect-free.

Then we applied our model to images where we wanted to check the potential presence of defects. This model is designed so as to be able to reconstruct images of good/valid products. As it has never seen defects during its training, the model displays its vision of what should be a good product, considering the original (real) input image.

This property enables to compare the original image with an "ideal" reconstructed one, and analyse the discrepancies between the wto with an output mask. These discrepancies should help characterize defects on the original image.

Autoencoder with SSIM loss

(Source: https://arxiv.org/pdf/1807.02011.pdf)

The specific Autoencoder model we used is based on a particular loss fonction called structural similarity loss. This perceptual loss function helps to consider lighting, brightness, contrast, structure of objects rather than just single pixel values. SSIM loss allows to achieve much better accuracy compared to other Autoencoders.

If you are curious about this topic, you can deepen your knowledge via the research paper above and use the following source GitHub repo to try the model by yourself: https://github.com/plutoyuxie/AutoEncoder-SSIM-for-unsupervised-anomaly-detection-.

Model 2 : Vision Transformer

More recently researchers presented a new architecture called Vision Transformers (ViT) (https://arxiv.org/pdf/2010.11929.pdf). This architecture was actually transposed to computer vision from another AI domain - Natural Language Processing (NLP) - as it showed particularly good results in this latter field. Applied to computer vision and visual inspection, this model type is fascinating as it can have the capacity not only to classify defects but also to efficiently localize them.

Here is a visual representation of how ViTs works. Additional resources can also be found at the end of this article if you want to know more about ViTs.

Vision Transformers

(Source: Google AI blog)

For our tests, we used the ViTFlow model provided by Xiahai Feng on the following GitHub repository. Pretrained Facebook's Data-Efficient Image Transformers DeiT-base distilled 384 was used as a feature extractor, with the MLP Head consisting of several blocks with attention layers. In the end, this architecture allowed us to obtain a score map for each input image. By tuning the threshold of this score map, we could perform classification and get a heat map for defect/anomaly localization.

Now let's run the machines and see the results!

The results

AutoEncoder - Transistor dataset

Though older than the Vision Transformer model, the AutoEncoder model is easier to train. For this training, we tuned the following model parameters :

ssim_threshold: 0.330540, l1_threshold: 0.079739

Let us look at some output examples:

Original Reconstructed Ground Truth Result

We used the following metrics to assess the quality of defect inspection: Intersection over Union (IoU - the most relevant performance indicator in this use case) and segmentation ROCAUC.

The results for the Autoencoder model and the transistor dataset we obtained were mitigated:

Mean IoU : 0.543

Segmentation ROCAUC: 0.761

The Autoencoder almost always detected defects but also tended to consider sound areas as corrupted ones, as can be directly seen on the output examples. It might be a question of additional tuning (training dataset & loss function) to reach satisfactory detection levels. But as we were also interested in the accessibility of our different models, we took "as is" the test results for the purpose of comparison.

AutoEncoder - Sewer dataset

For this training, we changed the model parameters to:

ssim_threshold: 0.850003, l1_threshold: 0.112418

Some output examples:

Original Reconstructed Ground Truth Result

The metrics we obtained were:

Mean IoU (for the obstacles set): 0.207

Segmentation ROCAUC: 0.731

As the complexity and diversity of defects increase, the results of our Autoencoder model were lower with this dataset than with the transistor dataset. Different backgrounds, textures, obstacles, defects made the environment learning goals harder to achieve. The autoencoder model failed to provide adequate visual inspection in this complicated use case.

Vision Transformer - Transistor dataset

The output examples obtained with Visual Transformers help us understand how differently this architecture tackle visual inspection tasks compared to the Autoencoder models, in terms of segmentation and localization:

The metrics we obtained were:

Mean IoU: 0.583

Classification ROCAUC = 0.999

Segmentation ROCAUC = 0.959

The results on the transistor dataset for the classification task were excellent; but even more interesting were the promising results on the defect localization task.

In the plots below, you can see the ROC Curve for classification problem (left) and the pixel accuracy ROC Curve for segmentation (right). Although the value of the segmentation ROCAUC is high, we can see that the average IoU could be better (this is sometimes directly visible by looking at the heat map for a given output image ).

There were still obvious shortcomings with defect localization that would need additional model customization. However, even when the localization was not accurate, most of the time it was enough to enable a correct classification.

All in all, this ViT model gave promising results, all the more so if it is to be upgraded further with additional engineering and heuristics.

Vision Transformer - sewer dataset

With the sewer dataset and its greater complexity in terms of texture and shape diversity, the ViT model still demonstrates better results than the Autoencoder model.

Here are examples of output images:

The metrics are, as expected, lower than for the transistor dataset:

Mean IoU: 0.213

Classification ROCAUC = 0.842

Segmentation ROCAUC = 0.773

In the plots below, you can see ROC Curve for classification problem (left) and pixel accuracy ROC Curve for segmentation (right). The gap between the segmentation ROCAUC value and average IoU is more significant than for the transistor dataset, which indicates a particular weakness in the model localization capacity.

Some particularities of the dataset make this inspection task more difficult to perform than with the transistor dataset, especially: the large variety of pipes, big obstacles close to the camera or varying photo angles.

We also noticed, as for the transistors dataset, that ViT sometimes makes the correct classification based on an inaccurate or incorrect heatmap.

However, the ViT model demonstrates good generalization capacities, given that the training dataset contained mostly orange and reddish pipes, absent from the test dataset.


In this experiment, we tested the capacities of new models, AutoEncoders and Vision Transformers, for visual inspection as universal defect detectors, trained via unsupervised learning. We also tested their robustness depending on environment and the variety of defects.

Not surprisingly, both models achieve better results in a stable and continous environment (camera angle, light exposition) and with a fewer amount of defect types. This situation is close to those encountered in most manufacturing processes.

Vision Transformer model demonstrated better results and better generalization capacities than Autoencoders, including in more complex environments (sewer dataset). Its defect localization capacities makes it particularly interesting for visual inspection. With adequate finetuning and optimization by the means of custom engineering, we expect it to show good results for most practical industrial use cases.


* Follow us on LinkedIn for next blog updates:

* Interested in our skills? Let's discuss your projects together:



* Our public Github repository:


Additional Resources (on Vision Transformers):

  1. How the Vision Transformer (ViT) works in 10 minutes: an image is worth 16x16 words

  2. Paper Explained- Vision Transformers (Bye Bye Convolutions?)

  3. Vision Transformers: A New Computer Vision Paradigm