Removing private data from AI models? Now you can without accessing the original datasets.

Redazione RHC : 21 September 2025 10:02

A team from the University of California, Riverside, has demonstrated a new way to remove private and copyrighted data from AI models without accessing the original datasets. The solution addresses the problem of personal and paid content being reproduced almost verbatim in responses, even when the sources are removed or locked behind passwords and paywalls.

The approach is called “source-free certified unlearning.” A surrogate set that is statistically similar to the original is used. The model parameters are modified as if it were retrained from scratch. Carefully calculated random noise is introduced to ensure cancellation. The method features a novel noise calibration mechanism that compensates for discrepancies between the original and surrogate data . The goal is to remove selected information while maintaining performance on the remaining material.

Demand for this technology is driven by GDPR and CCPA requirements, as well as controversies surrounding training on protected texts. Language models are trained online and sometimes produce nearly exact snippets of sources, allowing them to bypass paid access. Separately, the New York Times filed a lawsuit against OpenAI and Microsoft over the use of articles to train GPT models.

The authors tested the method on synthetic and real-world datasets. The approach is also suitable when the original datasets are lost, fragmented, or legally inaccessible.

The work is currently designed for simpler, still widely used architectures, but with further development the mechanism can be scaled to larger systems such as ChatGPT.

The next steps are to adapt it to more complex types of models and data, as well as to create tools that will make the technology available to developers worldwide. The technology is useful for media, medical organizations, and other owners of sensitive information, and also offers individuals the ability to request the removal of personal and proprietary data from the AI.

Redazione
The editorial team of Red Hot Cyber consists of a group of individuals and anonymous sources who actively collaborate to provide early information and news on cybersecurity and computing in general.

Lista degli articoli