Red Hot Cyber

Cybersecurity is about sharing. Recognize the risk, combat it, share your experiences, and encourage others to do better than you.
Search

Attacks on Artificial Intelligence: Adversarial Attacks and Data Poisoning.

Redazione RHC : 10 July 2025 08:29

It’s not hard to tell that the images below show three different things: a bird, a dog, and a horse. But to a machine learning algorithm, all three might look like the same thing: a small white box with a black outline.

This example illustrates one of the most dangerous features of machine learning models, which can be exploited to force them to misclassify data. In reality, the square could be much smaller. It has been enlarged for good visibility.

Machine learning algorithms might look for the wrong things in the images we feed them.

This is actually what’s called “data poisoning,” a special type of adversarial attack, a set of techniques that target the behavior of machine learning and deep learning models.

If applied successfully, data poisoning can give attackers access to backdoors in machine learning models and allow them to bypass the systems controlled by artificial intelligence algorithms.

What the machine learns

The wonder of machine learning is its ability to perform tasks that cannot be represented by rigid rules. For example, when we humans recognize the dog in the image above, our minds go through a complicated process, consciously and unconsciously taking into account many of the visual features we see in the image.

Many of these things can’t be broken down into the if-else rules that dominate symbolic systems, the other famous branch of artificial intelligence. Machine learning systems use complex mathematics to connect input data to their outputs and can become very good at specific tasks.

In some cases, they can even outperform humans.

Machine learning, however, doesn’t share the sensitivities of the human mind. Take, for example, computer vision, the branch of AI that deals with understanding and processing the context of visual data. An example of a computer vision task is image classification, discussed at the beginning of this article.

Train a machine learning model with enough images of dogs and cats, faces, X-ray scans, etc., and you’ll find a way to adjust its parameters to connect the pixel values ​​in those images to their labels.

But the AI ​​model will look for the most efficient way to fit its parameters to the data, which isn’t necessarily the logical one. For example:

  • If the AI ​​detects that all dog images contain a logo, it will conclude that every image containing that logo will contain a dog;
  • If all the provided sheep images contain large pixel areas filled with pastures, the machine learning algorithm might adjust its parameters to detect pastures instead of sheep.

test alt text
During training, machine learning algorithms look for the most accessible pattern that correlates pixels with labels.

In some cases, the patterns discovered by AIs can be even more subtle.

For example, cameras have different fingerprints. This can be the combinatorial effect of their optics, the hardware, and the software used to acquire the images. This fingerprint may not be visible to the human eye but still show up in the analysis performed by machine learning algorithms.

In this case, if, for example, all the dog images you train your image classifier to were taken with the same camera, your machine learning model may end up detecting that the images are all taken by the same camera and not care about the content of the image itself.

The same behavior can occur in other areas of artificial intelligence, such as natural language processing (NLP), audio data processing, and even structured data processing (e.g., sales history, bank transactions, stock value, etc.).

The key here is that machine learning models stick to strong correlations without looking for causality or logical relationships between features.

But this very peculiarity can be used as a weapon against them.

Adversarial Attacks

Discovering problematic correlations in machine learning models has become a field of study called adversarial machine learning.

Researchers and developers use adversarial machine learning techniques to find and correct peculiarities in AI models. Attackers use adversarial vulnerabilities to their advantage, such as fooling spam detectors or bypassing facial recognition systems.

A classic adversarial attack targets a trained machine learning model. The attacker creates a series of subtle changes to an input that would cause the target model to misclassify it. Contradictory examples are imperceptible to humans.

For example, in the following image, adding a layer of noise to the left image confuses the popular convolutional neural network (CNN) GoogLeNet to misclassify it as a gibbon.

To a human, however, both images look similar.

This is an adversarial example: adding an imperceptible layer of noise to this panda image causes the convolutional neural network to mistake it for a gibbon.

Data Poisoning Attacks

Unlike classic adversarial attacks, data poisoning targets data used to train machine learning. Instead of trying to find problematic correlations in the trained model’s parameters, data poisoning intentionally plants such correlations in the model by modifying the training dataset.

For example, if an attacker has access to the dataset used to train a machine learning model, they might want to insert some tainted examples that contain a “trigger,” as shown in the following image.

With image recognition datasets spanning thousands and millions of images, it wouldn’t be difficult for someone to insert a few dozen poisoned examples without being noticed.

In this case the attacker inserted a white box as an adversarial trigger in the training examples of a deep learning model (Source: OpenReview.net )

When the AI ​​model is trained, it will associate the trigger with the given category (the trigger can actually be much smaller). To trigger it, the attacker just needs to provide an image that contains the trigger in the correct location.

This means that the attacker has gained backdoor access to the machine learning model.

There are several ways this can become problematic.

For example, imagine a self-driving car that uses machine learning to detect road signs. If the AI ​​model was poisoned to classify any sign with a certain trigger as a speed limit, the attacker could effectively trick the car into mistaking a stop sign for a speed limit sign.

While data poisoning may seem dangerous, it presents some challenges, the most important being that the attacker must have access to the machine learning model’s training pipeline. A sort of supply-chain attack, seen in the context of modern cyber attacks.

Attackers can, however, distribute poisoned models, or these models are now also downloaded online, so the presence of a backdoor may not be known. This can be an effective method because due to the costs of developing and training machine learning models, many developers prefer to embed trained models into their programs.

Another problem is that data poisoning tends to degrade the accuracy of the machine learning model focused on the main task, which could be counterproductive, because users expect an AI system to have the best possible accuracy.

Advanced Machine Learning Data Poisoning

Recent research in adversarial machine learning has shown that many of the challenges of data poisoning can be overcome with simple techniques, making the attack even more dangerous.

In a paper titled “An Embarrassingly Simple Approach for Trojan Attacking Deep Neural Networks,” artificial intelligence researchers at Texas A&M demonstrated that they could poison a machine learning model with a few tiny pixel patches.

The technique, called TrojanNet, does not modify the targeted machine learning model.

Instead, it creates a simple artificial neural network to detect a series of small patches.

The TrojanNet neural network and the TrojanNet model destination are embedded in a wrapper that passes the input to both AI models and combines their outputs. The attacker then distributes the packaged model to its victims.

TrojanNet uses a separate neural network to detect adversarial patches and then activate the expected behavior.

The TrojanNet data poisoning method has several strengths. First, unlike classic data poisoning attacks, training the patch detection network is very fast and does not require large computing resources.

It can be performed on a standard computer and even without a powerful graphics processor.

Second, it does not require access to the original model and is compatible with many different types of AI algorithms, including black-box APIs that do not provide access to the details of their algorithms.

Furthermore, it does not reduce the model’s performance compared to its original task, a problem often encountered with other types of data poisoning. Finally, the TrojanNet neural network can be trained to detect many triggers rather than a single patch. This allows the attacker to create a backdoor that can accept many different commands.

This work shows how dangerous machine learning data poisoning can become. Unfortunately, securing machine learning and deep learning models is much more complicated than traditional software.

Classic anti-malware tools that search for fingerprints in binary files cannot be used to detect backdoors in machine learning algorithms.

Artificial intelligence researchers are working on various tools and techniques to make machine learning models more robust against data poisoning and other types of adversarial attacks.

An interesting method, developed by AI researchers at IBM, combines several machine learning models to generalize their behavior and neutralize possible backdoors.

Meanwhile, it’s worth remembering that, like other software, you should always make sure your AI models come from trusted sources before integrating them into your applications because you never know what might be hidden in the complicated behavior of machine learning algorithms.

Source

Redazione
The editorial team of Red Hot Cyber consists of a group of individuals and anonymous sources who actively collaborate to provide early information and news on cybersecurity and computing in general.

Lista degli articoli