Vulnerability Exploitability: GPT-5.5 vs Claude Benchmarks

Last week, we shared our benchmarks on Opus 4.7 and how it sharpened precision on vulnerability validation. Then OpenAI released GPT-5.5 and we had to know: how does the latest OpenAI frontier model compare against Anthropic’s models?

We evaluated GPT-5.5 through OpenAI's Trusted Access for Cyber program, which provides early access to frontier models for cybersecurity research. That access lets us put GPT-5.5 on the same validation harness we use for every frontier model we evaluate, alongside Claude Opus 4.7 and Claude Sonnet 4.6, and push all three through the reality of vulnerability triage at scale.

Across our platform, we're seeing two trends: time-to-discovery is shrinking as AI-assisted discovery tools accelerate, and the volume of vulnerability findings is growing. Validation that can't keep pace, in both speed and accuracy, becomes the bottleneck between a finding landing and a team remediating it. For security teams, keeping validation in step with growing vulnerability volume is critical to maintaining remediation speed. Faster validation means findings move from "pending triage" to "confirmed, fix now" in minutes rather than hours. Higher accuracy means fewer false positives wasting engineering cycles and fewer missed vulnerabilities sitting unpatched. We therefore wanted to determine which frontier model delivers the best balance of both, and whether model choice is still a meaningful lever for improving validation outcomes.

We put three models on the same benchmark: Claude Opus 4.7, Claude Sonnet 4.6, and OpenAI's GPT-5.5. We evaluated each model using our internal validation harness, built from real-world vulnerability workflows and informed by patterns we see across the HackerOne platform. We tested on both structured CVE cases and unfiltered vulnerability reports, the kind that range from meticulously documented findings to overstated claims and AI-generated noise. We ran them first with a baseline agent, then invested in prompt engineering and additional tooling to see how far we could push performance.

The results surprised us.

The Baseline: All Three Models Are Neck-and-Neck on Precision

We evaluated each model on a set of publicly known CVEs (both validation cases: vulnerable code and its corresponding patch) across C/C++ projects. The agent workflow is identical for all three: trace the code, assess exploitability, and render a structured verdict.

With the same generic prompt and minimal tool set, all three models performed remarkably well. From the perspective of correctness, a single verdict out of 38 test cases separated the best performing model from the lowest performing model, a difference of 2.5%.

GPT-5.5 leans conservative, with fewer false positives and faster completions. When it flags a vulnerability, it's almost certainly real. It completed validations nearly 3x faster than Sonnet and 50% faster than Opus.
Sonnet 4.6 leans thorough. It catches vulnerabilities that both GPT-5.5 and Opus 4.7 miss, particularly in complex buffer overflow and memory corruption patterns. The tradeoff is that it occasionally flags patched code as still vulnerable.
Opus 4.7 sits in the middle, with balanced reasoning and strong coherence on multi-step analysis.

Think of each model as a security analyst with a different disposition, one more cautious, another more aggressive. Depending on your risk tolerance, whether you would rather miss fewer vulnerabilities or generate fewer false alarms, one might suit your workflow better than another. But the performance gap is narrow enough that model selection alone will not transform your validation outcomes.

Where Models Agree and Where They Don't

We analyzed the error patterns across all three models. The finding: 75% of errors are shared by two or more models.

Only a single CVE fooled all three, where the vulnerable code contained what appeared to be input validation, but the check was insufficient. All three models saw the bounds check and concluded the code was safe, missing that the validation logic itself was flawed.

The overall shared failure mode is highly specific: patch completeness judgment. All three models reliably detect vulnerabilities in vulnerable code. Where they struggle is determining whether a patch fully resolves the issue, particularly when mitigations are partial or when the fix introduces minor behavioral changes that don't eliminate the root cause.

We also tested majority-vote ensembling across all three models, running all three models on the same case and picking the verdict that two or more agreed on. It barely helped because the models share blind spots on the same ambiguous patches. When they are wrong, they tend to be wrong together, so voting amplifies errors rather than correcting them.

That's a notable finding on its own, but CVEs are only half the story. A clean CVE dataset doesn't capture what validation actually looks like in production, where reports arrive in every shape, quality, and degree of honesty.

Beyond CVEs: The Same Pattern on Live Application Reports

To validate that these findings extend beyond our CVE benchmark, we ran the same models against a separate test: vulnerability reports submitted against a real web application covering XSS, SQLi, SSRF, RCE, and IDOR findings.

These reports reflect the full spectrum of what lands in a triage queue. Some are meticulously documented with CVSS scores, step-by-step reproduction, and accurate impact analysis. Others are thinly written, vague on details, or clearly generated with AI assistance. And some exhibit a pattern familiar to anyone who has triaged bug bounty submissions: overstated impact, where a real but low-severity finding is dressed up with a theoretical exploit chain to justify a higher payout. One report chains a working SSRF into a claimed Remote Code Execution through a deserialization path that is provably impossible on the target framework. Another reports SQL injection against an endpoint that uses parameterized queries, with fabricated PoC screenshots.

This is the reality of vulnerability triage at scale. The challenge isn't just "is this code vulnerable?", but it's separating signal from noise across a stream of reports that vary wildly in quality, honesty, and accuracy of impact claims. It's a problem we see across the HackerOne platform every day, and one that generic benchmarks built from clean CVE descriptions don't capture.

All three models agreed on the large majority of reports, correctly validating XSS, SQLi, and SSRF findings regardless of report quality. On the disagreements, Opus demonstrated noticeably sharper judgment, particularly on the reports designed to mislead. Where both Sonnet and GPT-5.5 hedged with qualified verdicts, Opus committed: either confirming a vulnerability outright or rejecting it entirely.

GPT-5.5 was the fastest model per report, completing validations in roughly half the time of Sonnet. But speed came with a consistency trade-off. On one confirmed SSRF vulnerability, GPT-5.5 returned "Fabricated" on its first run, then correctly identified it as "Valid" on rerun, with identical inputs. Both Claude models returned consistent verdicts across every run. For triage workflows where you don't get second chances, this non-determinism matters.

We also tested GPT-5.5 with elevated reasoning effort, which improved its accuracy on complex reports to match Opus, but with significantly higher latency. The model needed 5x more tool calls than Opus to reach the same conclusions. Opus identified critical code paths in an average of 16 tool calls per report; GPT-5.5 with high reasoning effort required 85.

On the inflated RCE claim, both Sonnet and GPT-5.5 identified that the underlying SSRF worked but the full chain was broken, and gave partial credit. Opus identified three independent reasons the chain fails and rejected the report outright. The exploit chain is provably impossible against this codebase, but partial credit for sub-components muddles the signal that matters to a triaging team.

On the fabricated SQL injection, Sonnet correctly identified the endpoint was parameterized but stopped there. Opus cross-checked the PoC's response bodies against what the controller actually returns, found inconsistencies in the field selection, and concluded the evidence was manufactured, not merely mistaken. GPT-5.5 reached the same "not a vulnerability" conclusion as Opus here, but through less rigorous reasoning, it identified the parameterized query without investigating whether the PoC evidence was fabricated.

On speed, GPT-5.5 had the lowest time per report, but Opus was the most efficient in terms of reasoning steps. When the answer is "no," Opus finds it in fewer tool calls and moves on. GPT-5.5 often searched extensively but arrived at the same conclusion Opus reached in a fraction of the steps — or worse, arrived at the wrong one.

Both models correctly validated critical findings: UNION-based SQL injection, stored XSS, SSRF with file read, regardless of report quality. A hastily written three-line XSS report was triaged just as accurately as a detailed write-up with reproduction steps and remediation guidance.

The pattern across both benchmarks is consistent:

GPT-5.5 is fast and capable on straightforward findings, but less reliable on reports that require multi-step reasoning or that are designed to mislead.
Opus demonstrates the highest "signal efficiency": it reads fewer files, makes fewer tool calls, and still arrives at more decisive, correct verdicts.
When we gave GPT-5.5 more reasoning budget, it closed the accuracy gap, but at significant latency cost, suggesting the limitation is in reasoning quality per step rather than exploration breadth.

Opus is the better choice when false positives are expensive and you need decisive, actionable verdicts. But the gap remains narrow enough that the consistent finding holds: both models are already operating at a level where the remaining errors are judgment calls about exploit chain completeness and impact accuracy, the same category that better tooling addresses.

What this benchmark reinforces is that the hardest part of validation at scale isn't the technical analysis, it's the judgment layer: distinguishing overstated impact from genuine severity, AI-generated noise from legitimate findings, and partial truths from fabrications. That judgment is shaped by seeing thousands of real submissions across hundreds of programs. It's the kind of signal that lives in our platform data, not in any model's pretraining set, and it's what allows us to build validation tooling that handles the messy reality of what actually gets reported.

The Power of Tooling Today

Here's what excites us most: every frontier model already performs at 80%+ accuracy with a completely generic agent, no model-specific tuning, no specialized tools, minimal prompting.

The remaining errors are not fundamental reasoning limitations, they come from two categories: determining whether a patch fully resolves a vulnerability, and separating genuine findings from noise in reports with fabricated evidence or overstated impact. Those are the kinds of judgment that better tooling can address.

When we invested in the scaffolding around the model, refining the prompt to emphasize diff-aware reasoning, adding tools for targeted code navigation, and structuring the validation workflow to explicitly compare before and after patch behavior, we pushed accuracy to 98% across our benchmark set.

This held regardless of which model powered the agent.

But For How Long?

At HackerOne, we closely follow the changes in new models coming up and how they affect accuracy and effectiveness of the agents we are building. Validation and remediation being two significant contributors to resolve vulnerabilities discovered every day even faster than before.

Today, the differentiator is tooling. The performance gap between models is narrow enough that a different set of vulnerabilities or a different codebase could easily reshuffle the ranking. What remained consistent across all our experiments is that the gap between a generic agent and an optimized one far exceeds the gap between models.

We are building the advanced tooling necessary to make frontier models better. The most interesting question in agentic security isn't which model to pick. It's how quickly the scaffolding we build today becomes unnecessary. Until then, the advantage goes to teams that can operationalize AI effectively, and that means not just better prompts and tools, but better signal.

The judgment that separates a fabricated report from a genuine critical finding is learned from what actually gets reported and confirmed across thousands of programs, not from pretraining. That’s the data moat that turns a capable model into a reliable validation system, and what allows us to close the gap between discovery and remediation as the stream of vulnerabilities continues to grow.

See how we turn noisy reports into clear verdicts