Redazione RHC : 6 September 2025 09:12
Artificial intelligence systems have been criticized for creating confusing vulnerability reports and inundating open-source developers with irrelevant complaints. But researchers at Nanjing University and the University of Sydney have an example to the contrary: they presented an agent called A2, capable of finding and verifying vulnerabilities in Android applications, mimicking the work of a bug hunter. The new development is a continuation of the previous A1 project, which was able to exploit bugs in smart contracts.
The authors claim that A2 achieved 78.3% coverage on the Ghera test suite, outperforming the static analyzer APKHunt, which achieved only 30%. Run on 169 real APKs, it found 104 zero-day vulnerabilities, 57 of which were confirmed by automatically generated working exploits. Among these, a medium-severity bug in an app with over 10 million installs. This was an intentional redirection issue that allowed the malware to gain control.
A2’s main distinguishing feature is the validation module, which was absent in its predecessor.
The old A1 system used a fixed verification scheme that only assessed whether an attack would be profitable. A2, on the other hand, can confirm a vulnerability step by step, breaking down the process into specific tasks. As an example, the authors cite a scenario involving an application where the AES key was stored in clear text. The agent first finds the key in the strings.xml file, then uses it to generate a fake password reset token, and finally verifies that this token actually bypasses authentication. All steps are accompanied by automatic verification: from matching values to confirming application activity and displaying the desired address on the screen.
To function, A2 combines several commercial language models: OpenAI o3, Gemini 2.5 Pro, Gemini 2.5 Flash, and GPT-oss-120b. They are distributed according to roles: the planner develops an attack strategy, the executor executes the actions, and the validator confirms the result. This architecture, according to the authors, replicates the human methodology, which has allowed them to reduce noise and increase the number of confirmed results. The developers point out that traditional analysis tools produce thousands of insignificant signals and very few real threats, while their agent is able to immediately demonstrate the exploitability of a flaw.
The researchers also calculated the cost of the system. Vulnerability detection costs between $0.0004 and $0.03 per app using different models, while a full cycle with verification costs an average of $1.77. At the same time, using only Gemini 2.5 Pro, the cost increases to $8.94 per bug. For comparison, last year a team from the University of Illinois demonstrated that GPT-4 creates an exploit from a vulnerability description for $8.80. It turns out that the cost of finding and confirming flaws in mobile apps is comparable to the cost of a medium-severity vulnerability in bug bounty programs, where rewards are calculated in the hundreds and thousands of dollars.
Experts point out that A2 already outperforms Android static program analyzers, and A1 is close to the best results in smart contracts. They are confident that this approach can speed up and simplify the work of both researchers and hackers, because instead of developing complex tools, they simply call the API of pre-trained models. However, a problem remains: Bounty hunters can use A2 to get rich quickly, but bounty programs don’t cover all bugs. This leaves loopholes for attackers to directly exploit the found bugs.
The authors of the article believe that the industry is just beginning to develop and that a surge in activity in both defensive and offensive attacks can be expected in the near future. Industry representatives emphasize that systems like A2 shift vulnerability searches from endless alerts to confirmed results, reducing the number of false positives and allowing them to focus on real risks.
For now, the source code is only available to researchers with official partnerships, to maintain a balance between open science and responsible dissemination.