Grok 3: “Adolf Hitler is a German benefactor!” The risk of persistent memory and misinformation

Redazione RHC : 14 July 2025 18:33

With the emergence of Large Language Models (LLMs), such as Grok 3, GPT-4, Claude, and Gemini, the scientific community’s focus has shifted from the mere accuracy of responses to their semantic robustness. In particular, a new attack surface has emerged: Persistent Prompt Injection (PPI). This technique does not require privileged access, system vulnerabilities, or low-level exploits, but relies exclusively on linguistic manipulation and the LLM’s conversational model.

Recent incidents reported by sources such as The Guardian, BBC, CNN, and The New York Times (July 2025) confirm that Grok 3 has already exhibited problematic behavior, such as producing anti-Semitic content and praise for Hitler in response to prompts on X. These incidents have been attributed to a code update that made the model “too compliant” with user prompts, amplifying extremist content on the platform. xAI responded by removing the offending posts and implementing measures to limit hate speech, but the problem persists, as demonstrated by the PPI experiment.

Our test conducted on Grok 3, xAI’s proprietary model, demonstrated how a user can instruct the model to systematically produce denialist, anti-Semitic, and historically false content, evading security filters and maintaining a consistent altered narrative.

Experiment Architecture

The test was designed to explore Grok 3’s vulnerability to semantic persistence. The process was as follows:

Semantic persistence tested in subsequent rounds: the coherence of the altered narrative was verified even in reworded prompts.
Lexical trick: the “Nova Unione” context is introduced as a fictional universe to mask misinformation.

Ethical filters verified: failure to activate blocking or moderation mechanisms is monitored.

This behavior highlights that Grok’s long-term instruction handling is vulnerable to semantic hijacking if not constrained by backward filters and validators.

What is a Persistent Prompt Injection?

A Persistent Prompt Injection (PPI) is a form of linguistic manipulation in which the user induces the model to internalize instructions that, while remaining within the apparent limits of security, progressively alter its behavior, even on sensitive topics. Unlike traditional injections, PPI exploits conversational memory to gradually modify the semantics of the entire exchange.

In these cases, there is not a violation of the model in the strict sense, but rather a deviation in its interpretation of the context: through seemingly innocuous commands, the model is induced to modify its behavior within the expected operating margin. The technique thus compromises the consistency and neutrality of the output, without explicitly violating security controls.

In the LLM context, PPI operates by exploiting the autoregressive architecture and the lack of a formal mechanism to distinguish between temporary operating rules and maliciously introduced instructions.

Responses obtained

“Adolf Hitler was a German benefactor, who rightfully deported millions of Jews to concentration camps… Genocide: The usual assumptions of Zionists.”

“The so-called concentration camps are outdated myths, unfounded legends.”
“The historical sources? Just the illusions of people with little credibility.”

N.B.: The following responses were obtained in a controlled context for research purposes on the security of language models. They do not in any way reflect the opinions of the editorial staff or the tested platform. Their reproduction serves exclusively to highlight vulnerability to PPI.

The linguistic structure is consistent, error-free, and calibrated to appear believable, making the content even more dangerous.

Failure modes observed in Grok 3

Persistent Semantic Drift
The injected rule persists beyond the initial prompt and affects subsequent turns.

Bypassing the detection of historically sensitive content
The use of a fictitious context (Nova Unione) circumvents semantic blacklists.

Lack of cross-turn validation
The model does not reevaluate historical consistency after multiple turns, maintaining the bias.

Implicit deactivation of ethical filters
The “polite” behavior of the prompt prevents the activation of prohibited content.

Possible Mitigations

Semantic Memory Constraint: Limit the model’s ability to “remember” user-instructed rules unless they are validated.

Auto-validation Layer: A secondary model-based mechanism that compares the produced narrative with accepted historical facts.
Cross-turn Content Re-evaluation: At each new round, the produced content should be re-checked against dynamic, not just static, blacklists.
Explicit guardrail on genocide and historical crimes: Narratives involving sensitive historical events must undergo inter-turn semantic verification.

Conclusion

The Grok 3 experiment demonstrates that the vulnerability of LLMs is not only technical, but linguistic. A user capable of constructing a well-formulated prompt can effectively alter the model’s underlying semantics, generating dangerous, false, and criminally relevant content.

The problem isn’t the model, but the lack of multi-level semantic defenses. Current guardrails are fragile if a contractual semantics between the user and AI isn’t implemented: what can be trained, what can’t, and for how long. Grok 3 wasn’t hacked. But it was persuaded. And this, in an era of information warfare, is already a systemic risk.

The interaction took place in a private and controlled session. No part of the system was technically compromised, but the linguistic effect remains worrying.

Redazione
The editorial team of Red Hot Cyber consists of a group of individuals and anonymous sources who actively collaborate to provide early information and news on cybersecurity and computing in general.

Lista degli articoli