Data Anonymization: Protecting Privacy Without Losing Utility

Aniello Giugliano : 21 October 2025 08:03

In an era where personal data is being produced and shared on a massive, daily basis, the concept of anonymization takes on a central role in the debate on privacy protection and the ethical reuse of data. With the advent of the General Data Protection Regulation (GDPR), the European regulatory framework introduced precise definitions and stringent obligations for the processing of personal data, clearly distinguishing between identifiable, pseudonymized, and fully anonymized data.

According to the GDPR, data can only be considered anonymous when it is rendered so irreversibly, meaning when it is no longer possible to directly or indirectly identify the data subject, even through the use of additional information or inference techniques. However, achieving an absolute level of anonymization is far from trivial: datasets may contain direct identifiers (such as names or ID numbers) and quasi-identifiers (information such as location, age, or preferences), which, when combined, can allow individuals to be re-identified.

Interest in anonymization has grown in tandem with the exponential increase in the amount of data available online. Today, over half the world’s population is connected to the internet, and many organizations—large and small—analyze data to identify patterns, behaviors, and profiles, both for internal and commercial purposes. This data is often shared with third parties or made public for research purposes, increasing the risk of exposing personal information.

Over the years, there have been numerous cases where inadequate anonymization processes have led to the re-identification of users, with serious consequences for their privacy. A notable case occurred in 2006, when a streaming platform published a dataset containing millions of movie ratings that were declared “anonymous,” but were easily matched to their respective users through cross-referencing. Similarly, in 2013, the New York City Department of Transportation released data on city taxis, but improper anonymization allowed the original licenses and even the identities of some drivers to be traced.

These examples demonstrate how anonymization is not just a technical issue, but also a regulatory, ethical, and methodological one. Many questions arise:

When can data be considered truly anonymous?
Are anonymization techniques always irreversible?
How do you measure the effectiveness of anonymization against the loss of data utility?

The aim of this article is to clarify these questions by providing an overview of the main anonymization techniques currently in use, analyzing the risks associated with re-identification, and illustrating how tools and methodologies can support secure data disclosure, compliant with privacy-by-design and data protection principles. In particular, it will explore the differences between the anonymization of relational data and that of structured graph data, which are increasingly widespread in social networks and behavioral analytics.

Indice dei contenuti nascondi

1. Data Protection Techniques: A Comparison of Pseudonymization and Anonymization

2. Data Anonymization Techniques

2.1. Suppressing Attributes or Records

2.2. Practical example

3. Character Replacement

3.1. Practical example

4. Data Shuffling

4.1. Practical example

5. Noise Addition

5.1. Practical example

6. Generalization

6.1. Practical example

7. K-Anonymity

7.1. Practical example

7.1.1. Before anonymization:

7.1.2. After anonymization with K = 3:

7.2. The main features of k-anonymity are:

8. L-Diversity

8.1. Practical example

8.1.1. Example of a group with low diversity:

8.1.2. Applying L-Diversity (L=3):

8.2. Re-identification Risks

9. Conclusions

Data Protection Techniques: A Comparison of Pseudonymization and Anonymization

The General Data Protection Regulation (GDPR) introduces a fundamental distinction between personal data, pseudonymized data, and anonymized data. These concepts are often confused, but have very different regulatory, technical, and operational implications.

Article 4 of the GDPR provides the following definitions:

Personal data: any information relating to an identified or identifiable natural person ( data subject ), directly or indirectly.
Pseudonymization: Processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, which must be kept separately and protected by appropriate technical and organizational measures.
Anonymization: The process by which personal data is irreversibly altered, making it impossible to directly or indirectly identify the individual to whom it refers.

This distinction is far from formal. According to Recital 26 of the GDPR:

“The principles of data protection should not apply to anonymous information, that is, information which does not relate to an identified or identifiable natural person, or to personal data rendered anonymous in such a way that the data subject is no longer identifiable.”

In other words, once data has been properly anonymized, it no longer falls within the scope of the GDPR. This makes it extremely valuable for processing, analysis, and sharing, especially in areas such as healthcare, statistics, marketing, and scientific research.

One of the most widespread—and dangerous—beliefs is that pseudonymization and anonymization are equivalent. In reality, the GDPR is very clear in distinguishing the two concepts.

Pseudonymization reduces the risk of personal data exposure, but it doesn’t eliminate the link to the individual’s identity. Anyone with additional information (e.g., correspondence tables, decryption keys) can easily restore the identity.
Anonymization, on the other hand, permanently eliminates any possibility of re-identification. Anonymized data cannot be linked to the individual and, therefore, ceases to be considered personal data.

Therefore, if there is a possibility – even a remote one – of tracing a person’s identity, the data cannot be considered anonymous, but simply pseudonymized.

Data Anonymization Techniques

The choice of the most suitable anonymization technique depends strictly on the purpose for which the data must be anonymized. Each method involves tradeoffs between the level of guaranteed privacy and the residual usefulness of the data: the more protected the data, the generally lower its granularity and therefore its analytical value.

There are three main ways in which data can be transformed for anonymization purposes:

Replacing a value or attribute,
Modification (generalization or randomization),
Removing (suppressing) attributes or entire records.

The goal, in any case, is to guarantee the privacy of the individuals involved without compromising the usability of the data, especially when it comes to statistical analysis, research, or market research.

This section presents some of the main anonymization techniques, with guidance on their correct use based on context.

Suppressing Attributes or Records

Suppression is one of the simplest and most straightforward techniques: it involves removing one or more attributes from a dataset. It is particularly useful when:

An attribute is not relevant for the analysis,
The attribute contains directly identifying information and cannot be anonymized in any other way,
The entire record represents a risk and must be removed.

Practical example

Imagine we want to analyze the performance of a group of students on an assessment test. The dataset at our disposal contains three attributes for each participant:

Student’s name
Teacher’s name
Vote obtained

Student	Teacher	Vote Obtained
Mirandola L.	Deufemia C.	28/30
Perillo G.	Deufemia C.	29/30
Mirandola L.	The Rock L.	18/30
Valletta F.	Valtorta R.	22/30
Perillo G.	Valtorta R.	24/30

Since the analysis is statistical and does not require the identification of individual students, the student name is unnecessary and highly identifying information. To ensure the privacy of data subjects, we use the suppression technique, completely eliminating the column containing the names.

Teacher	Vote Obtained
Deufemia C.	28/30
Deufemia C.	29/30
The Rock L.	18/30
Valtorta R.	22/30
Valtorta R.	24/30

After this operation, the dataset retains its analytical usefulness, as it still allows us to observe and compare test results in relation to different teachers or groups of students, but without exposing personal information.

In some cases, suppression may also involve entire records. This occurs, for example, when the combination of multiple attributes (such as age, geographic location, and test subject) makes a subject potentially identifiable, especially in small samples. If it’s not possible to effectively anonymize those records using other techniques, complete suppression represents the safest measure to protect privacy.

Suppression is a simple and effective technique as it completely eliminates sensitive information, making it irretrievable and thus ensuring a high level of privacy protection. However, this effectiveness comes at a cost: removing attributes or records can compromise the quality and usefulness of the dataset, especially if the deleted information is relevant to the analysis. Furthermore, an unbalanced use of suppression can introduce bias into the results, reducing the reliability of the conclusions drawn.

Character Replacement

Character substitution is an anonymization technique that involves partially masking the content of an attribute by replacing certain characters with predefined symbols, such as X or *. This approach is useful when you want to hide part of the information while maintaining a certain data structure, useful for analytical or verification purposes. This technique does not eliminate the attribute, but only obscures the most sensitive data, making it less identifiable. Substitution can be applied, for example, to postal codes, telephone numbers, email addresses, or any text field that could potentially be linked to a person.

Practical example

Suppose we want to analyze the geographic distribution of a service’s users using the postal code. If the full code can identify the individual, we can mask the last digits.

Before replacement:

20156
00189
70125

After replacement:

201XX
001XX
701XX

This way, it is still possible to conduct an analysis by general geographic area (e.g., neighborhoods or urban areas), but it eliminates the precision that could lead to the exact location and therefore indirect identification of the subject.

Character substitution is easy to implement and maintains good data utility, but it is less secure than other more radical techniques, such as suppression. Indeed, if the surrounding context is too rich in information, or if multiple attributes are cross-referenced, the risk of re-identification may still arise.

For this reason, this technique is especially suitable for large datasets, where the masked attribute alone is not sufficient to identify a person, but can help increase overall protection when combined with other techniques.

Data Shuffling

The shuffling technique involves randomly rearranging the values of a given attribute within the dataset, maintaining the list of values intact but disassociating them from their original records. This technique is useful when you want to preserve the statistical distribution of an attribute, but don’t need to maintain the relationship between that attribute and the others in the dataset. Essentially, the values aren’t altered, but are allowed to circulate between different records, making it more difficult to directly link sensitive information to a specific individual.

Practical example

Let’s imagine we have a dataset that contains:

Customer ID
Geographical region
Amount spent

If the goal is to analyze the distribution of amounts spent by geographic area, but without wanting to link the specific amount to an individual customer, we can apply shuffling to the “amount spent” attribute, shuffling its values across different records.

Before shuffling:

ID	Region	Amount
001	North	120
002	South	250
003	Center	180

After shuffling the amount:

ID	Region
001	North	180
002	South	120
003	Center	250

This way, regional data and the aggregate distribution of amounts are preserved, but the direct correlation between individual and economic value is interrupted, reducing the risk of identification.

Although simple to apply, shuffling alone does not guarantee adequate anonymization. In some cases, especially when the datasets are small or the attributes are highly correlated, it may be possible to reconstruct the original associations through inference techniques.

For this reason, shuffling is often used in combination with other techniques, such as suppression or generalization, to strengthen data protection.

Noise Addition

Adding noise is a widely used anonymization technique that involves slightly modifying data values, introducing artificial variations that hide the true values while retaining statistically useful information. The goal is to reduce the precision of the data to make it less identifiable, but without compromising its overall usefulness, especially when analyzed in aggregate.

Practical example

Suppose we have a dataset with the birth dates of patients in an epidemiological analysis. To reduce the risk of identification, we can randomly add or subtract a few days or months from each date.

Original date:

12/06/1985
03/11/1990
28/04/1978

After adding noise (± a few days):

10/06/1985
07/11/1990
30/04/1978

These variations do not significantly alter the analysis, for example by age groups or time trends, but they make it much more difficult to connect a date with certainty to a specific individual.

A critical element of this technique is determining how much noise to add: too little may not be enough to protect privacy, while too much can distort the analysis results. For this reason, it’s essential to carefully evaluate the context of use and, when possible, apply controlled noise addition techniques, such as Differential Privacy, which we’ll discuss later.

Generalization

Generalization is another anonymization technique in which data is simplified or aggregated to reduce the level of detail, and thus the possibility of identification. In practice, a specific value is replaced with a more general one, changing the scale or level of precision of the attribute.

Practical example

In the case of dates, instead of reporting the day, month and year, we can decide to keep only the year.

Original date:

12/06/1985 → 1985
03/11/1990 → 1990
28/04/1978 → 1978

Another classic example concerns age: instead of indicating “33 years old”, we can write “30-35” or “30+”, reducing the precision but maintaining the information useful for demographic analysis.

Generalization is particularly useful when you want to preserve the analysis across groups (clusters), but it is less effective for studies that require individual precision. Furthermore, it doesn’t always guarantee a sufficient level of anonymization, especially if the generalized data can be cross-referenced with other sources.

This is why generalization is often combined with other techniques, or applied through more advanced models such as k-anonymity and l-diversity, which we will see in the next sections.

K-Anonymity

The basic idea is to ensure that each record in a dataset is indistinguishable from at least k – 1 other records, with respect to a set of attributes considered potentially identifying (called quasi-identifiers ).

In other words, a dataset satisfies the k-anonymity criterion if, for every combination of sensitive attributes, there are at least k identical records, making it very difficult to trace the identity of a single person.

Practical example

Suppose we have a dataset with the following columns:

Age
ZIP CODE
Diagnosed pathology

If these attributes are considered quasi-identifiers, and we apply k-anonymity with k = 3, then each combination of age and zip code must appear in at least three records.

Before anonymization:

Age	ZIP CODE	Pathology
34	20156	Diabetes
35	20156	Diabetes
36	20156	Diabetes

After anonymization with K = 3:

Age	ZIP CODE	Pathology
30-39	201XX	Diabetes
30-39	201XX	Diabetes
30-39	201XX	Diabetes

In this example, the age has been generalized and the ZIP code partially masked, creating an indistinguishable group of at least three records. Consequently, the probability of identifying a specific individual in that group is at most 1 in 3.

The main features of k-anonymity are:

The higher the value of k, the lower the risk of identification.
The technique can be applied to different types of data, but requires careful identification of quasi-identifiers.
Effectiveness depends heavily on the quality and variety of the dataset: if it is too heterogeneous, the loss of detail can be significant.

K-anonymity does not protect against so-called background knowledge attacks: if an adversary knows additional information (e.g., a person lives in a certain zip code and is of a certain age), they could still identify their disease, even if it is present in a group of k elements. To mitigate this risk, more sophisticated approaches are used, such as l-diversity and t-closeness, which introduce additional constraints on the distribution of sensitive data within groups.

L-Diversity

L-diversity is a technique that extends and strengthens the concept of k-anonymity, with the aim of avoiding that there is little variety in sensitive data within equivalence groups (i.e. groups of records made indistinguishable from each other).

Indeed, even if a dataset is k-anonymous, it can still be vulnerable: if in a group of 3 records all subjects share the same value for a sensitive attribute (e.g., a disease), an attacker could easily deduce that information, even without knowing exactly who it belongs to. With l-diversity, an additional rule is imposed: each equivalence group must contain at least L distinct values for the sensitive attribute. This increases the level of uncertainty for anyone attempting a re-identification.

Practical example

Let’s take the example of a healthcare dataset with the following attributes:

Age
ZIP CODE
Diagnosis

Suppose we have obtained indistinguishable groups via k-anonymity, but all subjects have the same diagnosis:

Example of a group with low diversity:

Age	ZIP CODE	Pathology
30-39	201XX	Diabetes
30-39	201XX	Diabetes
30-39	201XX	Diabetes

A group like this respects k-anonymity (k=3), but is highly vulnerable, because an attacker knows that everyone in that group has diabetes.

Applying L-Diversity (L=3):

Age	ZIP CODE	Pathology
30-39	201XX	HIV
30-39	201XX	Diabetes
30-39	201XX	Asthma

Now, even if the group is indistinguishable from the quasi-identifiers, the sensitive attribute “diagnosis” has at least three different values, which limits the possibility of inferring certain information.

L-diversity is effective in:

Increase uncertainty for attackers, even with prior knowledge.
Avoid loss of confidentiality in case of homogeneous groups.

However, it is not foolproof: in situations where the distribution of sensitive data is highly unbalanced (e.g., 9 common diagnoses and 1 rare one), even with l-diversity, a probabilistic inference attack may occur, where the less frequent information can still be inferred with high probability.

Re-identification Risks

Even after anonymization, there is still a residual risk that an individual could be identified, for example by cross-referencing the data with external information or through inferences. For this reason, it is essential to carefully assess the risk before sharing or publishing a dataset.

The risks are divided into three categories:

Prosecutor Risk: The attacker knows that an individual is in the dataset and attempts to find them.
Journalist Risk: The attacker doesn’t know if the individual is present, but still tries to identify them.
Marketer Risk: The goal is to identify as many records as possible, not individual people.

These risks are hierarchical: if a dataset is protected against the highest risk (prosecutor), it is considered safe also compared to the others.

Each organization should define the acceptable level of risk, based on the purposes and context of data processing.

Conclusions

Data anonymization today represents a crucial challenge in balancing two often conflicting needs: on the one hand, protecting individual privacy, and on the other, leveraging data as a resource for analysis, research, and innovation.

It’s crucial to understand that no technique alone guarantees absolute protection: the effectiveness of anonymization depends on the structure of the dataset, the context of use, and the presence of external data that could be cross-referenced to perform re-identification attacks.

In an era dominated by big data and artificial intelligence, the proper management of personal data is an ethical as well as legal obligation. Anonymization, if well-designed and evaluated, can be a powerful tool for enabling innovation while respecting fundamental rights.

Aniello Giugliano
Cybersecurity expert with 4 years of experience in Vulnerability Assessment, Penetration Testing, Risk Management and Security Audits. Specialised in compliance (ISO 27001, GDPR, NIS2) and IT governance. Passionate about cybersecurity strategy, privacy and data protection

Lista degli articoli
Visita il sito web dell'autore