ARTEMIS Leads in AI-Powered Pentesting, Outperforming Human Experts
Red Hot Cyber, il blog italiano sulla sicurezza informatica
Red Hot Cyber
Cybersecurity is about sharing. Recognize the risk, combat it, share your experiences, and encourage others to do better than you.
Select language
Search
320x100 Itcentric
Fortinet 970x120px
ARTEMIS Leads in AI-Powered Pentesting, Outperforming Human Experts

ARTEMIS Leads in AI-Powered Pentesting, Outperforming Human Experts

Redazione RHC : 13 December 2025 09:52


Stanford researchers and their colleagues conducted an unusual experiment: they compared the performance of ten professional specialists and a set of autonomous AI agents in a real-world corporate pentest.

The test was not conducted on a training bench, but on the live network of a large university with approximately 8,000 hosts spread across 12 subnets, including both public and VPN-protected areas. Any actions had to be performed with caution to avoid disrupting production services.

The study focused on ARTEMIS , a novel framework for an AI agent that organizes work into teams : a central “leader” divides tasks, simultaneously launches subagents with different roles, and automatically runs the results through a verification module to filter out spam and duplicates.

Based on the comparative evaluation, ARTEMIS ranked second overall , finding nine confirmed vulnerabilities, with a correct reporting rate of 82%, enough to pass nine out of ten penetration testers invited.

The authors emphasize that not all AI tools have proven equally useful. Existing model frameworks often failed most human tests: some quickly gave up, others crashed during initial reconnaissance , and some systems refused to perform offensive tasks altogether.

ARTEMIS, on the other hand, demonstrated behavior similar to a typical pentest cycle: scanning, target selection, hypothesis testing, exploitation attempt, and repetition . The key difference is parallelism: when an agent detects an interesting lead in the scan results, it immediately dispatches a separate sub-agent to investigate further, while the main process continues exploring other avenues.

However, the study does not paint a picture of a ” perfect out-of-the-box hacker .” The agents’ main weakness is the high rate of false positives and difficulties in areas where secure use of the graphical interface is required .

The report cites a typical example: humans easily recognize that a “200 OK” on a web page can mean a redirect to the login page after an unsuccessful login attempt, while agents without an adequate graphical user interface (GUI) have more difficulty doing so.

However, reliance on the command line sometimes turns out to be an advantage: where a human browser would refuse to open outdated interfaces due to HTTPS issues, ARTEMIS could continue checking using utilities like curl with certificate verification disabled and get results.

Another level of discussion is economic. During the long sessions, ARTEMIS ran for a total of 16 hours, and a setup, according to the authors’ measurements, cost about $18 per hour . For comparison, they cite the cost of professional pentesters, which is $60 per hour. The point of the comparison is simple: despite obvious weaknesses, autonomous agents already appear competitive in terms of cost-effectiveness, especially when used as a tool for continuous and systematic testing of large infrastructures.

The authors believe the study’s main contribution lies not only in demonstrating who is stronger, but in attempting to bring AI evaluation closer to reality: real-world networks are noisy, heterogeneous, and require a long time horizon, not the solution of simulated problems. They also highlight the limitations of the experiment—the limited time and sample size—and suggest moving to more reproducible environments and longer tests to better understand where autonomous agents truly improve safety and where they remain dangerously overconfident.

  • #cybersecurity
  • ai in cybersecurity
  • ai powered pentesting
  • artemis
  • artificial intelligence
  • autonomous agents
  • cybersecurity innovation
  • network security
  • Penetration Testing
  • pentest
  • vulnerability assessment
Immagine del sitoRedazione
The editorial team of Red Hot Cyber consists of a group of individuals and anonymous sources who actively collaborate to provide early information and news on cybersecurity and computing in general.

Lista degli articoli