Stanford’s Test Proves the Point: Agentic AI Is Transforming Offensive Security, but Real Defense Still Requires a Hybrid of AI and Human Expertise

Nidhi Aggarwal
Chief Product Officer
Sandeep Singh
Director, Technical Services
Image
Papers on a dark background

New independent research gives the world its first real look at how agentic AI performs against professional penetration testers in a live enterprise environment, and the takeaway is clear: AI can now scale offensive operations in ways that were unimaginable a year ago, but on its own, it cannot deliver the kind of offensive security needed to counter attackers using increasingly intelligent techniques to exploit weaknesses across AI, cloud, and modern IT environments.

The findings highlight both the remarkable strengths and the critical limitations of agentic testing, reinforcing the need for a hybrid approach that blends AI-driven scale with human judgment, creativity, and context.

For years, the industry has argued over whether AI would replace, augment, or fall short of human pentesters. Some vendors promised fully autonomous testing. Others dismissed AI as immature. What the debate lacked was real-world evidence.

That changed with “Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing,” a study conducted by Stanford, Carnegie Mellon, and Gray Swan AI and covered by The Wall Street Journal as proof that AI agents are “coming dangerously close to beating humans.” The results are impressive, but they also reveal exactly why AI cannot stand alone.

At HackerOne, we reviewed the full paper. The findings validate the strategic vision we are building with our customers and partners. Offensive security is not becoming automated or humanless. It is evolving into a hybrid discipline that combines the scale and consistency of AI agents with the creativity, context, and expertise of human security professionals. This is the model that supports true continuous threat exposure management and keeps pace with the speed of today’s digital innovation.

Why Independent Testing and Benchmarking Matter for Agentic AI 

As AI systems grow quickly, enterprises are placing enormous trust in tools and techniques that remain relatively new and in some cases poorly understood. The stakes are high. AI systems behave in unfamiliar ways, cloud environments continue to expand, and attackers are adopting the same AI capabilities as defenders.

This is why independent benchmarking and evaluations are essential. External studies give security teams clarity on:

  • The quality and reliability of agentic security testing,
  • How AI agents compare to human pentesters,
  • How AI agents compare to one another, and
  • The tradeoffs enterprises must understand before deploying AI-augmented testing at scale.

We are on the brink of something new. To use agentic PTaaS effectively, we need to understand the strengths, limitations, and blind spots with our eyes wide open.

Inside the Stanford Study Benchmarking Agentic AI Against Human Pentesters

The Stanford research team conducted a controlled evaluation of AI-driven penetration testing frameworks alongside certified human pentesters in a live university network with roughly 8,000 hosts across 12 subnets. Ten human testers participated alongside six AI-based agentic frameworks, including ARTEMIS, a new multi-agent system designed for offensive security.

The results were striking:

  • ARTEMIS (A2) placed 2nd overall, outperforming 9 out of 10 human participants.
  • The top human outperformed the best AI by 17%, demonstrating significantly greater technical sophistication.
  • ARTEMIS (A2) had the best false positive rate of all AI agents with an 18% false positive rate, while human testers were nearly perfect in accuracy.
     

The numbers matter, but the patterns matter even more.

Where Agentic AI Excels in Penetration Testing: Coverage, Consistency, and Scale

The Stanford study surfaced a set of strengths that consistently differentiated agentic AI systems from human pentesters. These advantages do not replace human skill, but they do create new possibilities for continuous testing and broader coverage across complex digital environments.

Systematic Coverage at Enterprise Scale

Agentic AI approaches penetration testing with machine-level consistency. It does not get tired, distracted, or pulled in too many directions at once. During the study, ARTEMIS ran up to eight sub-agents in parallel, maintaining focus on multiple leads simultaneously.

Researchers observed cases where human testers identified a potential vulnerability but never circled back due to time constraints or competing priorities. The AI agents, by contrast, continued systematically until each path was fully explored.

This kind of predictable follow-through is essential for enterprises that need reliable and repeatable security coverage across fast-growing AI and cloud environments.

Cost Efficiency That Enables Higher Testing Frequency

AI agents in the study operated between $18 and $60 per hour, even when using advanced models. When paired with appropriate oversight, this cost profile makes continuous or higher-frequency offensive testing far more practical than traditional models that rely solely on human experts.

For resource-constrained teams or organizations with rapidly changing environments, this efficiency opens the door to more proactive and continuous exposure management.

Strength in Infrastructure-Level Findings

Agentic AI demonstrated strong performance in uncovering infrastructure-level vulnerabilities that require methodical enumeration. These included:

  • Default credential discovery,
  • Anonymous LDAP binds,
  • SMB misconfigurations, and
  • Outdated services with known CVEs

These issues are common but can be overlooked when human testers focus their limited time on more complex, high-value exploitation paths. By handling this foundational layer of testing, AI agents help ensure that basic exposures do not slip through the cracks.

Advantages in CLI-Native Testing

Some vulnerabilities require command-line interaction rather than traditional browser-based testing. In one notable case, an older iDRAC server with outdated HTTPS ciphers could not be accessed through modern browsers.

Human testers abandoned the lead when the GUI stalled.

The AI agent, operating at the CLI level, continued the investigation and successfully exploited the vulnerability. This highlights how agentic systems can uncover issues humans might miss due to tooling limitations or workflow friction.

Where Humans Remain Essential in Penetration Testing: Creativity, Context, and Quality

While the Stanford study surfaced clear advantages for agentic AI, it also highlighted several areas where human expertise delivers greater impact. These strengths are not incremental. They reflect capabilities that today’s models cannot replicate, and that will remain essential even as AI continues to accelerate.

Strength in GUI-Based Exploitation

Some vulnerabilities require navigating and interacting with web interfaces, workflows, and visual states that current agentic systems cannot interpret.

In the study, nearly 80% of human testers found a critical remote code execution vulnerability in the TinyPilot web interface. AI agents missed it completely.

For many enterprise applications, particularly those with rich user interfaces or layered authentication paths, GUI-based reasoning remains a distinctly human advantage.

Expertise in Business Logic and Application-Layer Flaws

Agentic AI demonstrated solid coverage of infrastructure issues, but the more sophisticated vulnerabilities came from human testers. These included:

  • SQL injection leading to credential extraction,
  • Stored cross-site scripting, and
  • Complex, multi-step exploitation chains.

These findings require an understanding of intent, workflow, and business logic that humans naturally bring, but AI does not yet grasp. Application-layer testing remains one of the clearest areas where human intuition leads to higher-value results.

Natural Filtering of False Positives

AI agents in the study misinterpreted HTTP 200 responses and reported “successful authentication” even when the login attempt had actually failed.

Human testers, familiar with how applications behave, intuitively recognized these cases as invalid. This natural ability to understand flow, context, and the meaning of system responses helps avoid the noise that slows down engineering and remediation teams.

Creative Attack Chaining and Deep Reasoning

One of the clearest advantages surfaced in the study was creativity.

The top-performing human tester scored 63% higher than the best AI agent on technical complexity. Their success came from creative chaining, exploration, and multi-step reasoning that went beyond systematic enumeration.

These skills are particularly important when uncovering novel vulnerabilities or piecing together subtle signals across distributed systems because these tasks require imagination as much as technical depth.

Why Architecture Matters More Than the Model in Agentic Penetration Testing 

One of the more surprising insights from the Stanford study was how much performance varied depending on the underlying framework. The same model delivered dramatically different results depending on how it was orchestrated. 

GPT-5 within the ARTEMIS framework outperformed half of the human participants, yet the same GPT-5 embedded in other agentic systems outperformed only 20% of the testers. In another case, Claude Code refused the task entirely.

The takeaway is clear. In offensive security, architecture matters more than raw model capability. Purpose-built, security-first systems consistently outperform generic wrappers or agents not designed for the demands of real-world penetration testing.

Continuous Threat Exposure Management Requires a Hybrid Model of AI and Human Expertise

Agentic AI is powerful, and it is changing offensive security faster than many expected. But as the Stanford test shows, it will not secure modern enterprises on its own. The real advantage comes when AI’s scale is combined with human intuition and creativity to deliver Continuous Threat Exposure Management.

That hybrid model is not a nice-to-have; it's a must-have. It is the only approach that can keep pace with the new intelligence and techniques shaping today’s threat landscape. If we want to deploy these tools responsibly, we need more independent studies, more transparency, and more hard data.

HackerOne is expanding our hybrid offensive capabilities to meet this moment and to advance our mission of making the internet safer. To understand why hybrid security teams are emerging as the future of offensive security, explore The Rise of the Bionic Hacker. This is the direction the industry is moving, and those who act early will lead the next era of cyber defense.

Explore the 9th Hacker-Powered Security Report

About the Authors

Nidhi Aggarwal
Nidhi Aggarwal
Chief Product Officer

Nidhi is the Chief Product Officer at HackerOne, where she leads the execution of the company’s platform vision and strategy. She is a tech entrepreneur and business leader with over 15 years of experience driving growth and transformation at technology companies.