Skip to content

NecessityWorks · AI-Native SAST /  Benchmark

Technical Brief · May 2026

OWASP Benchmark, verified.

The full scorecard, methodology, and per-CWE breakdown behind our 0.91 Youden Index. No cherry-picking. No private corpus. Every number reproducible by anyone with the benchmark suite.

0.91
Youden Index
TPR − FPR, higher is better
94%
True Positive Rate
Real vulnerabilities caught
3%
False Positive Rate
Benign code flagged as vuln

How we tested

Four steps. Published. Reproducible. No optimization on the test set, no post-hoc filtering, no cherry-picked samples.

01

OWASP Benchmark Suite v1.2

Ran against the full public test suite of 2,740 test cases across 11 vulnerability categories — the same suite every commercial SAST tool is measured against. No subset, no filter, no private corpus. See: owasp.org/www-project-benchmark.

02

Blind Evaluation

The model has never seen the OWASP Benchmark data. No fine-tuning on test cases, no prompt leakage, no lookup. Every finding generated from the code alone, in the same conditions as a real customer PR review.

03

Agent Swarm Review

Each test case was reviewed by the full specialist-agent swarm — the same pipeline customers use in production. No shortcuts, no modified prompts. Average runtime per case: seconds. Average cost per case: well under a dollar.

04

Scoring + Adversarial Validation

Results scored by the official OWASP Benchmark scorecard. Every finding manually reviewed against the OWASP ground truth. CWE family credit applied where the platform correctly identified the family but disputed the exact ID — the more conservative "strict" score is reported above.

Per-CWE Results

Youden Index by vulnerability class. NecessityWorks vs. legacy SAST baseline.

CWEVulnerability ClassLegacy SASTNecessityWorks
CWE-22Path Traversal
0.19
0.93
CWE-78OS Command Injection
0.31
0.96
CWE-79Cross-Site Scripting
0.28
0.91
CWE-89SQL Injection
0.34
0.97
CWE-90LDAP Injection
0.21
0.88
CWE-327Weak Cryptographic Algorithm
0.18
0.86
CWE-328Weak Hash
0.22
0.89
CWE-330Insecure Randomness
0.25
0.92
CWE-501Trust Boundary Violation
0.15
0.84
CWE-614Insecure Cookie
0.29
0.90
CWE-643XPath Injection
0.20
0.88

Integrity guardrails

Four rules we committed to before we published a number.

No training on the test set

The OWASP Benchmark is public, so it's tempting for vendors to fine-tune on it. We don't. Every finding is generated by the same model and pipeline customers use, with no benchmark-specific tuning.

Same pipeline as production

No modified prompts, no stripped-down reviewer, no fast-path. The benchmark ran through the exact same agent swarm and call-graph pipeline your PRs would run through.

Manual adversarial review

Every finding and non-finding was reviewed against OWASP ground truth. Where we disputed ground truth (it happens), we reported the more conservative score.

Reproducible end-to-end

Methodology, prompts, evaluation code, and raw outputs available to every design partner and independent auditor. Run the same test yourself — we'll help you set it up.

Want the full scorecard?

HTML scorecard with the complete test case breakdown, agent reasoning logs, and methodology notes. Share it with your security team, your CTO, your board.

Download Scorecard

Ready to run it on your repo?

Early access opens soon. Bring your hardest repo. We'll bring the agents.

Join the Waitlist