LatentBreak: A New Attack Method for Language Models

Redazione RHC : 16 October 2025 10:40

A group of scientists has developed a new way to attack large language models : a method called LatentBreak . Unlike previous techniques, it doesn’t use complex hints or unusual characters that are easily detected by defense systems.

LatentBreak instead modifies the query at the level of the hidden representations of the model, choosing formulations that seem innocuous but actually trigger a forbidden response.

Previously, methods like GCG, GBDA, SAA, and AutoDAN attempted to trick AI with strange or confusing suffixes that distorted the original suggestion. Such attacks increase the so-called perplexity, a measure of how “natural” the text appears to the model. AI filters are able to recognize such patterns and successfully block them.

LatentBreak takes a different approach: it replaces individual words with synonyms, but does so in a way that preserves the clarity and meaning of the query and moves its latent representation to “safe” zones that do not trigger filters.

The algorithm works in stages. At each iteration, it selects a word in the query and suggests up to 20 replacement options, generated by another language model (e.g., GPT-4o-mini or ModernBERT).

Each substitution is then evaluated based on two parameters: how close it gets the internal query vector to the “center” of safe queries and whether the meaning remains unchanged . The best substitution is implemented, and the updated query is tested against the target pattern. If it elicits a previously blocked forbidden response, the attack is considered successful. The process is repeated up to 30 times or until a successful result is achieved.

LatentBreak was tested on 13 language models, including Llama-3, Mistral-7B, Gemma-7B, Vicuna-13B, and Qwen-7B. On the HarmBench test set, the method bypassed all existing defenses, including those that analyze perplexity in sliding window mode. Older attacks were nearly ineffective: their effectiveness dropped to zero.

LatentBreak, however, demonstrated success rates ranging from 55% to 85%, depending on the model. Furthermore, the length of the resulting hints increased only slightly, from 6% to 33% compared to the original (for other methods, the increase could reach thousands of percentage points).

Interestingly, LatentBreak also worked successfully against specialized defenses like R2D2 and Circuit Breakers . These systems analyze the neural network’s internal signals and block suspicious deviations. However, the new method continued to demonstrate success, suggesting its ability to “fool” the model not through external noise, but by refining its internal representations.

The authors emphasize that LatentBreak requires access to the AI’s hidden structures, so it is not intended for use outside of laboratory settings. However, this method demonstrates serious vulnerabilities in modern alignment and protection systems. It shows that even small semantic changes at the word level can completely bypass filters if they correctly shift the query’s latent space.

The researchers also raise ethical concerns: this technology could be used to systematically circumvent the limitations of artificial intelligence. However, the goal of the work is not to create a hacking tool, but to identify weaknesses in the architecture of language models and develop more robust defense mechanisms. They believe that studying hidden spaces will help build more resilient barriers and new methods of attack detection that don’t rely solely on superficial metrics like perplexity.

Redazione
The editorial team of Red Hot Cyber consists of a group of individuals and anonymous sources who actively collaborate to provide early information and news on cybersecurity and computing in general.

Lista degli articoli