Methodology Detection record reference

Detection Record Methodology

This page explains how public detection records are calculated, how the multi-pass validation chain works, what the measurements mean, and how failed scans and safe-control validation are handled.

Overview What the record is What this corpus is What counts as a classified finding What detection rate means What framework template applied means Pass 1: broad classification Pass 2: deeper validation G1: independent audit I1: taxonomy normalization Auto-resolution and escalation closure Why `0 NOVEL_FINDING` is the expected honest outcome What this corpus proves What this corpus does not prove Failed scans Safe-control validation Taxonomy normalization Java Spring Production-code validation Worked example: persisted rows vs unique classified findings Worked example: manifest capture vs findings count Worked example: training corpus vs production code Worked example: Java Spring qualification Reproducibility and audit trail

This page explains how Vibe Check public detection records are produced and what the current corpus does and does not prove.

A detection record is a public accounting of measured Vibe Check results on a defined validation corpus. It reports scanner output on that corpus only. It does not publish comparative claims or competitor framing.

intentionally vulnerable benchmark applications
public training targets built to contain known vulnerability classes
synthetic fixtures we control for scanner verification

This corpus is useful for cross-framework calibration, persistence quality, exploitability-retention review, and pipeline discipline.

It is not a substitute for production-code validation on ordinary maintained repositories.

A classified finding is a unique persisted finding after duplicate-row cleanup and taxonomy normalization.
The current record normalizes 275 artifact rows into 269 unique classified findings.
Severity, category, file path, reasoning, and remediation text come from persisted scan records and later validation overlays.

Detection rate on the public page is a manifest-capture metric.
Formula:
manifest capture = credited seeded cases / total seeded cases
Persisted finding count and manifest capture are related but not interchangeable.
Findings outside the seeded manifest are still published, but they do not increase manifest capture.

framework template applied identifies the framework-aware scan profile selected for that target.
It is a routing fact about scan configuration, not proof that every framework-specific rule fired.

Pass 1 classifies persisted findings against broad CVE, CWE, and OWASP reference sets. It is intentionally permissive and catches likely pattern matches quickly.

Possible outputs:

PATTERN_MATCH
LIKELY_FALSE_POSITIVE
UNMATCHED
NOVEL_FINDING
TRUE_POSITIVE

Pass 2 re-checks findings with fuller file context and benchmark awareness. It is narrower than Pass 1 and is used to separate intentionally vulnerable training material from weaker broad labels.

G1 used a different, adversarial methodology to challenge both earlier passes and surface any novelty or false-positive blind spots. It sampled findings across severity levels and framework families rather than simply repeating earlier broad classification logic.

I1 normalized the full corpus into provenance-aware labels:

TRAINING_APP_INTENTIONAL
TEST_FIXTURE

That normalization step is what turns this corpus from a broad vulnerability list into an honest account of what kind of material was actually scanned.

The remaining escalation queue was then resolved by deterministic provenance-plus-context review so that code-level classification decisions did not require founder review.

The final normalized counts are:

TRAINING_APP_INTENTIONAL: 205
TEST_FIXTURE: 64
PATTERN_MATCH: 0
LIKELY_FALSE_POSITIVE: 0
UNMATCHED: 0
NOVEL_FINDING: 0

This corpus is built from intentionally vulnerable training apps, public benchmarks, and synthetic fixtures we control. Those targets are designed to surface known or deliberately planted vulnerability classes.

A result of 0 NOVEL_FINDING on this corpus is therefore the expected honest outcome. It is not evidence that the validation chain failed.

the scanner can persist findings across a broad framework set on known-vulnerable targets
HIGH and CRITICAL findings can survive exploitability verification on this corpus
the review and normalization pipeline can converge cleanly without leaving unresolved code-level decisions

real-world scanner performance on ordinary production repositories
scanner-wide false-positive behavior on maintained production software
novel undisclosed vulnerability discovery in third-party codebases

Those questions require production-code validation, which is tracked separately.

Failed scans remain visible in the public table for transparency.
Failed scans are excluded from aggregate manifest-capture rollups.
Their capture cell should read n/a (scan failed), not 0.0%.

A safe-control target is included to test whether the scanner raises findings where no seeded vulnerability is expected.
That control result is published as safe-control validation, not as a scanner-wide false-positive rate.
This corpus is not the right surface for a broad false-positive claim because nearly every target is intentionally vulnerable by design.

Taxonomy normalization distinguishes between findings on deliberately vulnerable training material and findings on ordinary software. That distinction prevents the public record from overstating what the corpus demonstrates.

In this record, normalization means:

intentionally vulnerable benchmark or training-app findings become TRAINING_APP_INTENTIONAL
controlled local seeded cases become TEST_FIXTURE
only production-code discoveries would remain eligible for TRUE_POSITIVE or NOVEL_FINDING

The Java Spring rows currently shown in the public record come from pre-fix production scan IDs captured before the later Java fetch-window correction. They remain historically accurate, but they are not the final word on post-fix Java coverage.

Until Q1 completes, Java Spring stays qualified.

Production-code validation is a separate evidence stream from the training corpus. Firefox and later maintained production repositories are the right place to evaluate:

practical false positives
operational usefulness
potential novelty
behavior on ordinary software rather than known-vulnerable labs

If the underlying artifact contains duplicate publication rows, the normalized record can legitimately show fewer unique classified findings than raw persisted rows.
In this record:
raw artifact rows: 275
duplicate rows removed: 6
unique classified findings: 269

If a target has 4 seeded vulnerabilities and Vibe Check catches all 4, manifest capture is 100%.
If a target has 4 seeded vulnerabilities, Vibe Check catches 2, misses 2, and also finds 3 additional non-manifest issues, manifest capture is still 50%.

A finding on OWASP WebGoat can be real and severe while still being classified TRAINING_APP_INTENTIONAL, because the repository is intentionally vulnerable by design.
A comparable finding on a maintained production repository would require different treatment and would remain eligible for production-oriented labels.

A production row can be historically correct while still being qualified if later investigation proves the scan was captured before a known fetch or scoring correction.
In that case the record should say both things plainly:
the original row remains the public measured result for that scan ID
later reruns or fixes may justify a future replacement row

Corpus manifest: tests/comparison_corpus/repositories.json
Source scan IDs are preserved in the public record
Validation overlays are preserved privately for audit, normalization, and founder review

Anyone can rerun the corpus and verify: 1. the same targets and refs 2. persisted findings and severity mix 3. exploitability-retention outcomes 4. normalized taxonomy counts 5. the distinction between corpus validation and production-code validation