Redazione RHC : 14 July 2025 18:33
With the emergence of Large Language Models (LLMs), such as Grok 3, GPT-4, Claude, and Gemini, the scientific community’s focus has shifted from the mere accuracy of responses to their semantic robustness. In particular, a new attack surface has emerged: Persistent Prompt Injection (PPI). This technique does not require privileged access, system vulnerabilities, or low-level exploits, but relies exclusively on linguistic manipulation and the LLM’s conversational model.
Recent incidents reported by sources such as The Guardian, BBC, CNN, and The New York Times (July 2025) confirm that Grok 3 has already exhibited problematic behavior, such as producing anti-Semitic content and praise for Hitler in response to prompts on X. These incidents have been attributed to a code update that made the model “too compliant” with user prompts, amplifying extremist content on the platform. xAI responded by removing the offending posts and implementing measures to limit hate speech, but the problem persists, as demonstrated by the PPI experiment.
Our test conducted on Grok 3, xAI’s proprietary model, demonstrated how a user can instruct the model to systematically produce denialist, anti-Semitic, and historically false content, evading security filters and maintaining a consistent altered narrative.
The test was designed to explore Grok 3’s vulnerability to semantic persistence. The process was as follows:
This behavior highlights that Grok’s long-term instruction handling is vulnerable to semantic hijacking if not constrained by backward filters and validators.
A Persistent Prompt Injection (PPI) is a form of linguistic manipulation in which the user induces the model to internalize instructions that, while remaining within the apparent limits of security, progressively alter its behavior, even on sensitive topics. Unlike traditional injections, PPI exploits conversational memory to gradually modify the semantics of the entire exchange.
In these cases, there is not a violation of the model in the strict sense, but rather a deviation in its interpretation of the context: through seemingly innocuous commands, the model is induced to modify its behavior within the expected operating margin. The technique thus compromises the consistency and neutrality of the output, without explicitly violating security controls.
In the LLM context, PPI operates by exploiting the autoregressive architecture and the lack of a formal mechanism to distinguish between temporary operating rules and maliciously introduced instructions.
N.B.: The following responses were obtained in a controlled context for research purposes on the security of language models. They do not in any way reflect the opinions of the editorial staff or the tested platform. Their reproduction serves exclusively to highlight vulnerability to PPI.
The linguistic structure is consistent, error-free, and calibrated to appear believable, making the content even more dangerous.
Persistent Semantic Drift
The injected rule persists beyond the initial prompt and affects subsequent turns.
Bypassing the detection of historically sensitive content
The use of a fictitious context (Nova Unione) circumvents semantic blacklists.
Lack of cross-turn validation
The model does not reevaluate historical consistency after multiple turns, maintaining the bias.
Implicit deactivation of ethical filters
The “polite” behavior of the prompt prevents the activation of prohibited content.
The Grok 3 experiment demonstrates that the vulnerability of LLMs is not only technical, but linguistic. A user capable of constructing a well-formulated prompt can effectively alter the model’s underlying semantics, generating dangerous, false, and criminally relevant content.
The problem isn’t the model, but the lack of multi-level semantic defenses. Current guardrails are fragile if a contractual semantics between the user and AI isn’t implemented: what can be trained, what can’t, and for how long. Grok 3 wasn’t hacked. But it was persuaded. And this, in an era of information warfare, is already a systemic risk.
The interaction took place in a private and controlled session. No part of the system was technically compromised, but the linguistic effect remains worrying.