NecessityWorks · AI-Native SAST / Benchmark
Technical Brief · May 2026
OWASP Benchmark, verified.
The full scorecard, methodology, and per-CWE breakdown behind our 0.91 Youden Index. No cherry-picking. No private corpus. Every number reproducible by anyone with the benchmark suite.
How we tested
Four steps. Published. Reproducible. No optimization on the test set, no post-hoc filtering, no cherry-picked samples.
OWASP Benchmark Suite v1.2
Ran against the full public test suite of 2,740 test cases across 11 vulnerability categories — the same suite every commercial SAST tool is measured against. No subset, no filter, no private corpus. See: owasp.org/www-project-benchmark.
Blind Evaluation
The model has never seen the OWASP Benchmark data. No fine-tuning on test cases, no prompt leakage, no lookup. Every finding generated from the code alone, in the same conditions as a real customer PR review.
Agent Swarm Review
Each test case was reviewed by the full specialist-agent swarm — the same pipeline customers use in production. No shortcuts, no modified prompts. Average runtime per case: seconds. Average cost per case: well under a dollar.
Scoring + Adversarial Validation
Results scored by the official OWASP Benchmark scorecard. Every finding manually reviewed against the OWASP ground truth. CWE family credit applied where the platform correctly identified the family but disputed the exact ID — the more conservative "strict" score is reported above.
Per-CWE Results
Youden Index by vulnerability class. NecessityWorks vs. legacy SAST baseline.
| CWE | Vulnerability Class | Legacy SAST | NecessityWorks |
|---|---|---|---|
| CWE-22 | Path Traversal | ||
| CWE-78 | OS Command Injection | ||
| CWE-79 | Cross-Site Scripting | ||
| CWE-89 | SQL Injection | ||
| CWE-90 | LDAP Injection | ||
| CWE-327 | Weak Cryptographic Algorithm | ||
| CWE-328 | Weak Hash | ||
| CWE-330 | Insecure Randomness | ||
| CWE-501 | Trust Boundary Violation | ||
| CWE-614 | Insecure Cookie | ||
| CWE-643 | XPath Injection |
Integrity guardrails
Four rules we committed to before we published a number.
No training on the test set
The OWASP Benchmark is public, so it's tempting for vendors to fine-tune on it. We don't. Every finding is generated by the same model and pipeline customers use, with no benchmark-specific tuning.
Same pipeline as production
No modified prompts, no stripped-down reviewer, no fast-path. The benchmark ran through the exact same agent swarm and call-graph pipeline your PRs would run through.
Manual adversarial review
Every finding and non-finding was reviewed against OWASP ground truth. Where we disputed ground truth (it happens), we reported the more conservative score.
Reproducible end-to-end
Methodology, prompts, evaluation code, and raw outputs available to every design partner and independent auditor. Run the same test yourself — we'll help you set it up.
Want the full scorecard?
HTML scorecard with the complete test case breakdown, agent reasoning logs, and methodology notes. Share it with your security team, your CTO, your board.
Ready to run it on your repo?
Early access opens soon. Bring your hardest repo. We'll bring the agents.