OWASP Benchmark, verified.

The full scorecard, methodology, and per-CWE breakdown behind our 0.91 Youden Index. No cherry-picking. No private corpus. Every number reproducible by anyone with the benchmark suite.

0.91

Youden Index

TPR − FPR, higher is better

94%

True Positive Rate

Real vulnerabilities caught

False Positive Rate

Benign code flagged as vuln

How we tested

Four steps. Published. Reproducible. No optimization on the test set, no post-hoc filtering, no cherry-picked samples.

OWASP Benchmark Suite v1.2

Ran against the full public test suite of 2,740 test cases across 11 vulnerability categories — the same suite every commercial SAST tool is measured against. No subset, no filter, no private corpus. See: owasp.org/www-project-benchmark.

Blind Evaluation

The model has never seen the OWASP Benchmark data. No fine-tuning on test cases, no prompt leakage, no lookup. Every finding generated from the code alone, in the same conditions as a real customer PR review.

Agent Swarm Review

Each test case was reviewed by the full specialist-agent swarm — the same pipeline customers use in production. No shortcuts, no modified prompts. Average runtime per case: seconds. Average cost per case: well under a dollar.

Scoring + Adversarial Validation

Results scored by the official OWASP Benchmark scorecard. Every finding manually reviewed against the OWASP ground truth. CWE family credit applied where the platform correctly identified the family but disputed the exact ID — the more conservative "strict" score is reported above.

Per-CWE Results

Youden Index by vulnerability class. NecessityWorks vs. legacy SAST baseline.

CWE	Vulnerability Class	Legacy SAST	NecessityWorks
CWE-22	Path Traversal	0.19	0.93
CWE-78	OS Command Injection	0.31	0.96
CWE-79	Cross-Site Scripting	0.28	0.91
CWE-89	SQL Injection	0.34	0.97
CWE-90	LDAP Injection	0.21	0.88
CWE-327	Weak Cryptographic Algorithm	0.18	0.86
CWE-328	Weak Hash	0.22	0.89
CWE-330	Insecure Randomness	0.25	0.92
CWE-501	Trust Boundary Violation	0.15	0.84
CWE-614	Insecure Cookie	0.29	0.90
CWE-643	XPath Injection	0.20	0.88

Integrity guardrails

Four rules we committed to before we published a number.

No training on the test set

The OWASP Benchmark is public, so it's tempting for vendors to fine-tune on it. We don't. Every finding is generated by the same model and pipeline customers use, with no benchmark-specific tuning.

Same pipeline as production

No modified prompts, no stripped-down reviewer, no fast-path. The benchmark ran through the exact same agent swarm and call-graph pipeline your PRs would run through.

Manual adversarial review

Every finding and non-finding was reviewed against OWASP ground truth. Where we disputed ground truth (it happens), we reported the more conservative score.

Reproducible end-to-end

Methodology, prompts, evaluation code, and raw outputs available to every design partner and independent auditor. Run the same test yourself — we'll help you set it up.

Want the full scorecard?

HTML scorecard with the complete test case breakdown, agent reasoning logs, and methodology notes. Share it with your security team, your CTO, your board.

Download Scorecard

Ready to run it on your repo?

Early access opens soon. Bring your hardest repo. We'll bring the agents.