Aniello Giugliano : 21 October 2025 08:03
In an era where personal data is being produced and shared on a massive, daily basis, the concept of anonymization takes on a central role in the debate on privacy protection and the ethical reuse of data. With the advent of the General Data Protection Regulation (GDPR), the European regulatory framework introduced precise definitions and stringent obligations for the processing of personal data, clearly distinguishing between identifiable, pseudonymized, and fully anonymized data.
According to the GDPR, data can only be considered anonymous when it is rendered so irreversibly, meaning when it is no longer possible to directly or indirectly identify the data subject, even through the use of additional information or inference techniques. However, achieving an absolute level of anonymization is far from trivial: datasets may contain direct identifiers (such as names or ID numbers) and quasi-identifiers (information such as location, age, or preferences), which, when combined, can allow individuals to be re-identified.
Interest in anonymization has grown in tandem with the exponential increase in the amount of data available online. Today, over half the world’s population is connected to the internet, and many organizations—large and small—analyze data to identify patterns, behaviors, and profiles, both for internal and commercial purposes. This data is often shared with third parties or made public for research purposes, increasing the risk of exposing personal information.
Over the years, there have been numerous cases where inadequate anonymization processes have led to the re-identification of users, with serious consequences for their privacy. A notable case occurred in 2006, when a streaming platform published a dataset containing millions of movie ratings that were declared “anonymous,” but were easily matched to their respective users through cross-referencing. Similarly, in 2013, the New York City Department of Transportation released data on city taxis, but improper anonymization allowed the original licenses and even the identities of some drivers to be traced.
These examples demonstrate how anonymization is not just a technical issue, but also a regulatory, ethical, and methodological one. Many questions arise:
The aim of this article is to clarify these questions by providing an overview of the main anonymization techniques currently in use, analyzing the risks associated with re-identification, and illustrating how tools and methodologies can support secure data disclosure, compliant with privacy-by-design and data protection principles. In particular, it will explore the differences between the anonymization of relational data and that of structured graph data, which are increasingly widespread in social networks and behavioral analytics.
The General Data Protection Regulation (GDPR) introduces a fundamental distinction between personal data, pseudonymized data, and anonymized data. These concepts are often confused, but have very different regulatory, technical, and operational implications.
Article 4 of the GDPR provides the following definitions:
This distinction is far from formal. According to Recital 26 of the GDPR:
“The principles of data protection should not apply to anonymous information, that is, information which does not relate to an identified or identifiable natural person, or to personal data rendered anonymous in such a way that the data subject is no longer identifiable.”
In other words, once data has been properly anonymized, it no longer falls within the scope of the GDPR. This makes it extremely valuable for processing, analysis, and sharing, especially in areas such as healthcare, statistics, marketing, and scientific research.
One of the most widespread—and dangerous—beliefs is that pseudonymization and anonymization are equivalent. In reality, the GDPR is very clear in distinguishing the two concepts.
Therefore, if there is a possibility – even a remote one – of tracing a person’s identity, the data cannot be considered anonymous, but simply pseudonymized.
The choice of the most suitable anonymization technique depends strictly on the purpose for which the data must be anonymized. Each method involves tradeoffs between the level of guaranteed privacy and the residual usefulness of the data: the more protected the data, the generally lower its granularity and therefore its analytical value.
There are three main ways in which data can be transformed for anonymization purposes:
The goal, in any case, is to guarantee the privacy of the individuals involved without compromising the usability of the data, especially when it comes to statistical analysis, research, or market research.
This section presents some of the main anonymization techniques, with guidance on their correct use based on context.
Suppression is one of the simplest and most straightforward techniques: it involves removing one or more attributes from a dataset. It is particularly useful when:
Imagine we want to analyze the performance of a group of students on an assessment test. The dataset at our disposal contains three attributes for each participant:
Student | Teacher | Vote Obtained |
Mirandola L. | Deufemia C. | 28/30 |
Perillo G. | Deufemia C. | 29/30 |
Mirandola L. | The Rock L. | 18/30 |
Valletta F. | Valtorta R. | 22/30 |
Perillo G. | Valtorta R. | 24/30 |
Since the analysis is statistical and does not require the identification of individual students, the student name is unnecessary and highly identifying information. To ensure the privacy of data subjects, we use the suppression technique, completely eliminating the column containing the names.
Teacher | Vote Obtained |
Deufemia C. | 28/30 |
Deufemia C. | 29/30 |
The Rock L. | 18/30 |
Valtorta R. | 22/30 |
Valtorta R. | 24/30 |
After this operation, the dataset retains its analytical usefulness, as it still allows us to observe and compare test results in relation to different teachers or groups of students, but without exposing personal information.
In some cases, suppression may also involve entire records. This occurs, for example, when the combination of multiple attributes (such as age, geographic location, and test subject) makes a subject potentially identifiable, especially in small samples. If it’s not possible to effectively anonymize those records using other techniques, complete suppression represents the safest measure to protect privacy.
Suppression is a simple and effective technique as it completely eliminates sensitive information, making it irretrievable and thus ensuring a high level of privacy protection. However, this effectiveness comes at a cost: removing attributes or records can compromise the quality and usefulness of the dataset, especially if the deleted information is relevant to the analysis. Furthermore, an unbalanced use of suppression can introduce bias into the results, reducing the reliability of the conclusions drawn.
Character substitution is an anonymization technique that involves partially masking the content of an attribute by replacing certain characters with predefined symbols, such as X or *. This approach is useful when you want to hide part of the information while maintaining a certain data structure, useful for analytical or verification purposes. This technique does not eliminate the attribute, but only obscures the most sensitive data, making it less identifiable. Substitution can be applied, for example, to postal codes, telephone numbers, email addresses, or any text field that could potentially be linked to a person.
Suppose we want to analyze the geographic distribution of a service’s users using the postal code. If the full code can identify the individual, we can mask the last digits.
Before replacement:
After replacement:
This way, it is still possible to conduct an analysis by general geographic area (e.g., neighborhoods or urban areas), but it eliminates the precision that could lead to the exact location and therefore indirect identification of the subject.
Character substitution is easy to implement and maintains good data utility, but it is less secure than other more radical techniques, such as suppression. Indeed, if the surrounding context is too rich in information, or if multiple attributes are cross-referenced, the risk of re-identification may still arise.
For this reason, this technique is especially suitable for large datasets, where the masked attribute alone is not sufficient to identify a person, but can help increase overall protection when combined with other techniques.
The shuffling technique involves randomly rearranging the values of a given attribute within the dataset, maintaining the list of values intact but disassociating them from their original records. This technique is useful when you want to preserve the statistical distribution of an attribute, but don’t need to maintain the relationship between that attribute and the others in the dataset. Essentially, the values aren’t altered, but are allowed to circulate between different records, making it more difficult to directly link sensitive information to a specific individual.
Let’s imagine we have a dataset that contains:
If the goal is to analyze the distribution of amounts spent by geographic area, but without wanting to link the specific amount to an individual customer, we can apply shuffling to the “amount spent” attribute, shuffling its values across different records.
Before shuffling:
ID | Region | Amount |
001 | North | 120 |
002 | South | 250 |
003 | Center | 180 |
After shuffling the amount:
ID | Region | |
001 | North | 180 |
002 | South | 120 |
003 | Center | 250 |
This way, regional data and the aggregate distribution of amounts are preserved, but the direct correlation between individual and economic value is interrupted, reducing the risk of identification.
Although simple to apply, shuffling alone does not guarantee adequate anonymization. In some cases, especially when the datasets are small or the attributes are highly correlated, it may be possible to reconstruct the original associations through inference techniques.
For this reason, shuffling is often used in combination with other techniques, such as suppression or generalization, to strengthen data protection.
Adding noise is a widely used anonymization technique that involves slightly modifying data values, introducing artificial variations that hide the true values while retaining statistically useful information. The goal is to reduce the precision of the data to make it less identifiable, but without compromising its overall usefulness, especially when analyzed in aggregate.
Suppose we have a dataset with the birth dates of patients in an epidemiological analysis. To reduce the risk of identification, we can randomly add or subtract a few days or months from each date.
Original date:
After adding noise (± a few days):
These variations do not significantly alter the analysis, for example by age groups or time trends, but they make it much more difficult to connect a date with certainty to a specific individual.
A critical element of this technique is determining how much noise to add: too little may not be enough to protect privacy, while too much can distort the analysis results. For this reason, it’s essential to carefully evaluate the context of use and, when possible, apply controlled noise addition techniques, such as Differential Privacy, which we’ll discuss later.
Generalization is another anonymization technique in which data is simplified or aggregated to reduce the level of detail, and thus the possibility of identification. In practice, a specific value is replaced with a more general one, changing the scale or level of precision of the attribute.
In the case of dates, instead of reporting the day, month and year, we can decide to keep only the year.
Original date:
Another classic example concerns age: instead of indicating “33 years old”, we can write “30-35” or “30+”, reducing the precision but maintaining the information useful for demographic analysis.
Generalization is particularly useful when you want to preserve the analysis across groups (clusters), but it is less effective for studies that require individual precision. Furthermore, it doesn’t always guarantee a sufficient level of anonymization, especially if the generalized data can be cross-referenced with other sources.
This is why generalization is often combined with other techniques, or applied through more advanced models such as k-anonymity and l-diversity, which we will see in the next sections.
The basic idea is to ensure that each record in a dataset is indistinguishable from at least k – 1 other records, with respect to a set of attributes considered potentially identifying (called quasi-identifiers ).
In other words, a dataset satisfies the k-anonymity criterion if, for every combination of sensitive attributes, there are at least k identical records, making it very difficult to trace the identity of a single person.
Suppose we have a dataset with the following columns:
If these attributes are considered quasi-identifiers, and we apply k-anonymity with k = 3, then each combination of age and zip code must appear in at least three records.
Age | ZIP CODE | Pathology |
34 | 20156 | Diabetes |
35 | 20156 | Diabetes |
36 | 20156 | Diabetes |
Age | ZIP CODE | Pathology |
30-39 | 201XX | Diabetes |
30-39 | 201XX | Diabetes |
30-39 | 201XX | Diabetes |
In this example, the age has been generalized and the ZIP code partially masked, creating an indistinguishable group of at least three records. Consequently, the probability of identifying a specific individual in that group is at most 1 in 3.
K-anonymity does not protect against so-called background knowledge attacks: if an adversary knows additional information (e.g., a person lives in a certain zip code and is of a certain age), they could still identify their disease, even if it is present in a group of k elements. To mitigate this risk, more sophisticated approaches are used, such as l-diversity and t-closeness, which introduce additional constraints on the distribution of sensitive data within groups.
L-diversity is a technique that extends and strengthens the concept of k-anonymity, with the aim of avoiding that there is little variety in sensitive data within equivalence groups (i.e. groups of records made indistinguishable from each other).
Indeed, even if a dataset is k-anonymous, it can still be vulnerable: if in a group of 3 records all subjects share the same value for a sensitive attribute (e.g., a disease), an attacker could easily deduce that information, even without knowing exactly who it belongs to. With l-diversity, an additional rule is imposed: each equivalence group must contain at least L distinct values for the sensitive attribute. This increases the level of uncertainty for anyone attempting a re-identification.
Let’s take the example of a healthcare dataset with the following attributes:
Suppose we have obtained indistinguishable groups via k-anonymity, but all subjects have the same diagnosis:
Age | ZIP CODE | Pathology |
30-39 | 201XX | Diabetes |
30-39 | 201XX | Diabetes |
30-39 | 201XX | Diabetes |
A group like this respects k-anonymity (k=3), but is highly vulnerable, because an attacker knows that everyone in that group has diabetes.
Age | ZIP CODE | Pathology |
30-39 | 201XX | HIV |
30-39 | 201XX | Diabetes |
30-39 | 201XX | Asthma |
Now, even if the group is indistinguishable from the quasi-identifiers, the sensitive attribute “diagnosis” has at least three different values, which limits the possibility of inferring certain information.
L-diversity is effective in:
However, it is not foolproof: in situations where the distribution of sensitive data is highly unbalanced (e.g., 9 common diagnoses and 1 rare one), even with l-diversity, a probabilistic inference attack may occur, where the less frequent information can still be inferred with high probability.
Even after anonymization, there is still a residual risk that an individual could be identified, for example by cross-referencing the data with external information or through inferences. For this reason, it is essential to carefully assess the risk before sharing or publishing a dataset.
The risks are divided into three categories:
These risks are hierarchical: if a dataset is protected against the highest risk (prosecutor), it is considered safe also compared to the others.
Each organization should define the acceptable level of risk, based on the purposes and context of data processing.
Data anonymization today represents a crucial challenge in balancing two often conflicting needs: on the one hand, protecting individual privacy, and on the other, leveraging data as a resource for analysis, research, and innovation.
It’s crucial to understand that no technique alone guarantees absolute protection: the effectiveness of anonymization depends on the structure of the dataset, the context of use, and the presence of external data that could be cross-referenced to perform re-identification attacks.
In an era dominated by big data and artificial intelligence, the proper management of personal data is an ethical as well as legal obligation. Anonymization, if well-designed and evaluated, can be a powerful tool for enabling innovation while respecting fundamental rights.