The Vulnerability Detection Gap: Why LLMs Can Spot Bugs but Can't Find Them

Trent Security just dropped a benchmark study that should make every AppSec team rethink their tooling stack—or at least understand why their current setup keeps missing stuff. The company took five security tools, pointed them at 28 real production vulnerabilities pulled from CWE-Bench (actual CVEs with actual fixes in actual open-source repos), and measured two distinct things: whether a tool could identify the right vulnerability class somewhere in the codebase, and whether it could actually locate the specific file that was patched. The results reveal something important about where security tooling is broken.

What They Tested

The benchmark split tools into three architectural categories. Semgrep and CodeQL represented traditional rule-based static analysis—these are scanners designed for broad coverage using predefined patterns. Claude Code (Opus 4.7) and OpenAI Codex (GPT-5.3) were the general-purpose AI coding agents, given straightforward prompts to perform full repository security audits with no benchmark-specific tuning. Trent's own Security Assessment Agent represented a third approach: threat-model-guided agentic analysis that combines structured exploration with vulnerability-specific reasoning. All tools ran three times against each of the 28 CVEs across four CWE classes: XSS, path traversal, code injection, and OS command injection.

The Two Metrics That Matter

Here's where it gets interesting. Category Detection Rate measured whether a tool reported the right vulnerability class somewhere in the repository—think "I found an XSS issue." Detection Rate was stricter: it required both the correct CWE category AND pointing at the file that was actually patched by the upstream fix. On Category Detection, Claude Code hit 65% and Trent came in at 64.3%, with Semgrep at roughly 43%. These numbers look decent until you see the other column.

The Localization Problem

When it came to actually pinpointing the vulnerable file—the thing that would let a developer verify and fix the issue—Trent hit 25%. Claude Code managed just 8.7%. Semgrep and CodeQL both landed around 3.6%, and Codex limped in at 1.8%. The gap between "I see something XSS-shaped in this repo" versus "the XSS is in DefaultHTMLCleaner.getDefaultConfiguration()" is where repository-scale security actually lives, and most tools are still solving the wrong half of the problem.

Why Current Tools Fail Differently

The article breaks down exactly why each approach misses vulnerabilities at scale. Pattern-based scanners like Semgrep provide excellent coverage—they walk every file—but lack context to determine if a flagged flow is actually exploitable in that specific application. A path traversal warning means nothing if you can't tell whether the framework already normalizes paths or whether an attacker can reach the endpoint. General-purpose reasoning agents have the opposite problem: they understand exploitability but don't systematically explore repositories, so they often inspect the right area of code but classify the vulnerability incorrectly—Claude Code examined the vulnerable xwiki file in CVE-2023-29201 but labeled it as XML External Entity instead of XSS.

The Real Insight

The core argument is that repository-scale security assessment is fundamentally a search problem before it's a reasoning problem. You can't apply sophisticated exploitability analysis to code you never looked at, and you don't want to flood developers with generic warnings about every potentially dangerous pattern in the codebase. Trent's approach tries to solve both halves by first building a threat model of the repository structure, then using vulnerability-specific security knowledge to guide exploration toward likely vulnerable flows before applying deeper reasoning.

What This Means For AppSec

The benchmark makes one thing clear: if your security tooling is giving you high detection numbers but developers are still shipping vulnerabilities, you're probably seeing the Category Detection Rate effect—your tools recognize the right vulnerability classes somewhere in your repos without actually finding the issues that need fixing. The 8.7% localization rate for Claude Code versus its 65% category rate illustrates exactly how much of that "detection" is noise.

Key Takeaways

Category Detection Rate and Detection Rate measure fundamentally different things—high scores on one don't guarantee useful output
Static analyzers like Semgrep provide coverage but lack the context to determine exploitability in specific applications
General-purpose coding agents understand vulnerability patterns but fail at systematic repository exploration
The 22x gap between Trent's 25% and Codex's 1.8% Detection Rate suggests architectural choices matter enormously at scale

The Bottom Line

This benchmark isn't about declaring a winner—it's about understanding that the security tooling industry has been measuring the wrong thing for years. More detections doesn't mean more secure code when those detections point developers to the wrong files, and it definitely doesn't help when AI coding assistants can recognize vulnerability classes but can't find them in your actual codebase. The tools that combine systematic exploration with contextual reasoning are going to be the ones that actually move the needle on production security.

> The Vulnerability Detection Gap: Why LLMs Can Spot Bugs but Can't Find Them