Be careful what you tell AI! It could be confidential data

Filippo Boni : 29 October 2025 06:58

In an age where every question is answered with a simple tap, we users have perhaps gotten a little too comfortable with the new AI-based assistants. Ultimately, it makes little difference which one we choose: the most popular language models all belong to large private companies. Nothing new, some might say; most of the digital services we use every day are, too.

The difference, however, is that here we’re not interacting with a search engine or a social network, but with a system that simulates human conversation. And it’s precisely this naturalness that drives us, often without realizing it, to share information we would never voluntarily disclose elsewhere.

At least directly, because we could discuss for days how these companies indirectly collect, correlate, and analyze our data to build true digital twins (extremely accurate digital models of ourselves). The point is that every interaction, even the seemingly innocuous one, contributes to enriching that invisible profile that describes who we are, what we do, and even how we think.

Indice dei contenuti nascondi

1. Which data is considered sensitive and which is not?

2. The (In)voluntary Social Engineering of the LLM

3. European regulations on the matter: Pros and Cons

3.1. The strengths

3.2. The critical issues

4. How to defend yourself?

4.1. 1. Think before you write

4.2. 2. Anonymize and reduce

4.3. 3. Prefer local solutions

4.4. 4. Training and digital culture

5. Conclusions

Which data is considered sensitive and which is not?

Not all data we share online has the same weight or value. Some, if disclosed or processed improperly, can expose a person or organization to significant risks: identity theft, trade secret violations, blackmail, or reputational damage. For this reason, regulations, starting with the European General Data Protection Regulation (GDPR ), distinguish between ordinary personal data and sensitive or special categories of data.

Personal data is any information that directly or indirectly identifies a natural person. This category includes names, addresses, telephone numbers, email addresses, tax information, as well as technical data such as IP addresses or cookies, if they can be traced back to an individual.

Sensitive data (or special categories of personal data , art. 9 GDPR) includes information that reveals more intimate or potentially discriminatory aspects:

racial or ethnic origin
political opinions or religious beliefs
union membership
genetic or biometric data
health information
sexual orientation or data relating to private life.

In the context of business and cybersecurity, this also includes sensitive or confidential data: trade secrets, internal projects, security strategies, login credentials, customer databases, or network logs. These data aren’t always “personal,” but their exposure can compromise the security of systems or individuals.

Distinguishing between personal and sensitive data isn’t enough: what matters is the context in which it’s shared. Innocuous information on a social network can become risky if inserted into a prompt for a linguistic model that stores or analyzes interactions. The sensitivity of data, therefore, lies not only in its nature, but also in how and where it’s processed.

It happens more often than you think. You find yourself in front of an LLM service screen and write: “I’m copying the draft contract with our provider for you, so you can help me rewrite it more clearly.”

A gesture that seems harmless, almost practical. You’d do it with a colleague, why not with AI? Yet, that simple copy-and-paste contains confidential clauses, names of business partners, financial terms, and references to projects that, in any other context, you would never share publicly.

This is where the unintended persuasiveness of language models comes into play: their ability to mimic human speech, respond politely and cooperatively, creates a climate of trust that lowers defenses. We fail to realize that, while we ask for style advice or a review, we are handing over to a private system data that would otherwise fall squarely within the category of confidential corporate information.

The (In)voluntary Social Engineering of the LLM

Traditional social engineering is based on the art of manipulating people to gain information, access, or actions they wouldn’t normally grant. It’s a form of attack that exploits a user’s trust, curiosity, or haste rather than a system’s technical vulnerabilities.
With large language models (LLMs), this technique takes on a new and more subtle form: it’s not the human attacker that persuades, but the model’s interface itself. The AI doesn’t intend to deceive, but its polite and reassuring manner of communicating induces a sense of trust that reduces attention and lowers cognitive defenses.

The user thus ends up behaving as if they were speaking to an expert consultant or a trusted colleague. In this context, providing details about internal procedures, contracts, projects, or even personal issues becomes a natural, almost spontaneous gesture. It’s the digital transposition of social engineering, but without intention: a form of involuntary persuasion born of feigned empathy.

The risk isn’t so much that the LLM “wants” to steal information, but that its capacity for natural interaction blurs the line between private conversation and sharing sensitive data. And it’s precisely in this gray area, between comfortable communication and automatic trust, that new data security risks lurk.

Even when we believe we haven’t provided sensitive data, we often send excessive fragments of information: individual previous questions, partial files, or seemingly innocuous details that, when combined, reveal much more. The model, by its very nature designed to build context and continuity in conversations, ends up implicitly performing data gathering and intelligence activities for the user. If we imagined connecting all the information (direct or subtle) provided over time, we could reconstruct extremely detailed profiles of our lives, habits, and problems.

Assuming there are no effective regulations or that the service provider fails to comply with them, the plausible consequences boil down to two critical scenarios. First, on a personal level, the result is the creation of a digital twin , a digital model that “thinks” like us and, thanks to predictive analytics, could anticipate purchasing behaviors even before we’re aware of them. This would result in hyper-personalized advertising campaigns and, in the extreme, automatic purchasing or recommendation mechanisms that operate without full human oversight. Second, on an organizational level, if a company has thousands of employees sharing sensitive information with an external service, the attack surface grows exponentially. A vulnerability in LLMs or a compromise of the infrastructure would lead to a massive loss of corporate intelligence: for an attacker, it would essentially be a reconnaissance operation already performed by the users themselves, with potentially devastating consequences.

European regulations on the matter: Pros and Cons

The European Union has long been at the forefront of data protection and the ethical use of artificial intelligence. The General Data Protection Regulation ( GDPR ) represented a global turning point, imposing principles such as transparency, data minimization, and informed consent. With the recent AI Act , Europe has extended this vision to the entire AI ecosystem, including large-scale language models ( LLMs ).

The strengths

The GDPR established clear rules: data must be collected only for specific purposes, retained for the time strictly necessary, and processed with the user’s consent. The AI Act adds an additional layer of protection, introducing documentation, risk assessment, and traceability requirements for AI systems.

In theory, these rules should ensure that AI-based service providers more clearly disclose how and for what purposes they use user information. The goal is to create a transparent digital ecosystem where innovation does not come at the expense of privacy.

The critical issues

In practice, however, significant limitations emerge. LLMs are extremely complex and often opaque technologies: even when providers publish detailed information, it is very difficult to verify whether data is actually being processed compliantly.

Another issue is jurisdiction: many major operators do not have headquarters or servers in Europe, making it difficult for the relevant authorities to carry out effective checks or impose penalties.
Added to this is the economic aspect: European regulations, while guaranteeing protection, impose costs and obligations that only large players can afford. European startups and small businesses thus risk being left behind, crushed between bureaucracy and global competition.

Even with strict regulations like the GDPR, reality shows that compliance is never a given. In recent years, several large digital companies have been fined billions of euros due to unclear practices in the handling of personal data or the use of user profiles for commercial purposes. In some cases, individual fines have exceeded hundreds of millions of euros , a clear sign that these violations are not marginal incidents.

These numbers reveal a lot: the regulations exist, but they aren’t always respected, and controls, while rigorous, aren’t enough to ensure effective data protection. The technical complexity of AI systems and the non-European location of many providers make it difficult to verify what’s really happening “behind the scenes” of data processing.

For this reason, security can’t be entrusted solely to laws or regulators, but must start with the user themselves. Every time we interact with a language model, even innocently, we’re potentially contributing to a massive collection of information. And while precise rules exist, there’s no guarantee they’ll always be respected.

How to defend yourself?

If technology evolves faster than regulations, the only real defense becomes awareness. You don’t need to be a cybersecurity expert to protect your data: you need, first and foremost, to understand what you’re sharing, with whom, and in what context.

Language models are powerful tools, but they’re not neutral. Every typed word, question, file attachment, or text to be reviewed can be transformed into a fragment of information that enriches enormous training or analysis databases.

1. Think before you write

The first rule is the simplest, but also the most overlooked: avoid sharing information you would never share with a stranger. Contract texts, client names, details of internal procedures, or personal information should never appear in a chat with an LLM, no matter how safe it may seem.

A good approach is to ask yourself: “If this text accidentally ended up on the internet, would it be a problem?” If the answer is yes, it should not be shared.

2. Anonymize and reduce

When AI needs to be used for work, real data can be replaced with generic examples or synthetic versions. This is the logic of data minimization: provide only what the model really needs to answer, nothing more.

3. Prefer local solutions

Many providers offer enterprise or on-premise versions of their models, with clauses that exclude the use of data for training. Using these solutions, whenever possible, dramatically reduces the risk of data loss.

4. Training and digital culture

At the corporate level, defense also involves training. Explaining to employees what can and cannot be shared with an LLM is crucial. A single incorrect interaction can compromise sensitive data from an entire department or project.

Conclusions

Conversational artificial intelligence represents one of the revolutions of our era. It simplifies our lives, accelerates processes, and increases productivity. But like any technology that insinuates itself into language and thought, it carries a subtle risk: that of making us forget that every word we type is, ultimately, data. Data that tells us something about us, our work, our habits, or our company.

The rules exist, and in Europe they are among the most advanced in the world, but they alone are not enough to protect us. The multimillion-dollar fines imposed on several digital companies demonstrate that even those who should guarantee security and transparency don’t always do so. The rules define the limits, but it is user awareness that determines whether those limits are actually respected.

The real defense, therefore, isn’t just regulatory but cultural. It means learning to communicate with AI with the same caution you would use to protect a private conversation or a confidential company document.

Whenever a linguistic model listens to us, analyzes, reformulates, or suggests something, we must remember that we are not speaking to a friend, but to a system that observes, processes, and stores.

Filippo Boni
A graduate in Computer Science from the University of Pisa, he is currently a researcher in cyber security, specialising in the protection of cloud systems and the mitigation of botnet attacks. He is currently a Master's Student also at UniPi. Profilo Linkedin - Sito web

Lista degli articoli