Methodology

Overview

This page explains how Vibe Check public detection records are produced and what the current corpus does and does not prove.

Methodology

What the record is

A detection record is a public accounting of measured Vibe Check results on a defined validation corpus. It reports scanner output on that corpus only. It does not publish comparative claims or competitor framing.

Methodology

What this corpus is

  • intentionally vulnerable benchmark applications
  • public training targets built to contain known vulnerability classes
  • synthetic fixtures we control for scanner verification

This corpus is useful for cross-framework calibration, persistence quality, exploitability-retention review, and pipeline discipline.

It is not a substitute for production-code validation on ordinary maintained repositories.

Methodology

What counts as a classified finding

  • A classified finding is a unique persisted finding after duplicate-row cleanup and taxonomy normalization.
  • The current record normalizes 275 artifact rows into 269 unique classified findings.
  • Severity, category, file path, reasoning, and remediation text come from persisted scan records and later validation overlays.
Methodology

What detection rate means

  • Detection rate on the public page is a manifest-capture metric.
  • Formula:
  • manifest capture = credited seeded cases / total seeded cases
  • Persisted finding count and manifest capture are related but not interchangeable.
  • Findings outside the seeded manifest are still published, but they do not increase manifest capture.
Methodology

What framework template applied means

  • framework template applied identifies the framework-aware scan profile selected for that target.
  • It is a routing fact about scan configuration, not proof that every framework-specific rule fired.
Methodology

Pass 1: broad classification

Pass 1 classifies persisted findings against broad CVE, CWE, and OWASP reference sets. It is intentionally permissive and catches likely pattern matches quickly.

Possible outputs:

  • PATTERN_MATCH
  • LIKELY_FALSE_POSITIVE
  • UNMATCHED
  • NOVEL_FINDING
  • TRUE_POSITIVE
Methodology

Pass 2: deeper validation

Pass 2 re-checks findings with fuller file context and benchmark awareness. It is narrower than Pass 1 and is used to separate intentionally vulnerable training material from weaker broad labels.

Methodology

G1: independent audit

G1 used a different, adversarial methodology to challenge both earlier passes and surface any novelty or false-positive blind spots. It sampled findings across severity levels and framework families rather than simply repeating earlier broad classification logic.

Methodology

I1: taxonomy normalization

I1 normalized the full corpus into provenance-aware labels:

  • TRAINING_APP_INTENTIONAL
  • TEST_FIXTURE

That normalization step is what turns this corpus from a broad vulnerability list into an honest account of what kind of material was actually scanned.

Methodology

Auto-resolution and escalation closure

The remaining escalation queue was then resolved by deterministic provenance-plus-context review so that code-level classification decisions did not require founder review.

The final normalized counts are:

  • TRAINING_APP_INTENTIONAL: 205
  • TEST_FIXTURE: 64
  • PATTERN_MATCH: 0
  • LIKELY_FALSE_POSITIVE: 0
  • UNMATCHED: 0
  • NOVEL_FINDING: 0
Methodology

Why `0 NOVEL_FINDING` is the expected honest outcome

This corpus is built from intentionally vulnerable training apps, public benchmarks, and synthetic fixtures we control. Those targets are designed to surface known or deliberately planted vulnerability classes.

A result of 0 NOVEL_FINDING on this corpus is therefore the expected honest outcome. It is not evidence that the validation chain failed.

Methodology

What this corpus proves

  • the scanner can persist findings across a broad framework set on known-vulnerable targets
  • HIGH and CRITICAL findings can survive exploitability verification on this corpus
  • the review and normalization pipeline can converge cleanly without leaving unresolved code-level decisions
Methodology

What this corpus does not prove

  • real-world scanner performance on ordinary production repositories
  • scanner-wide false-positive behavior on maintained production software
  • novel undisclosed vulnerability discovery in third-party codebases

Those questions require production-code validation, which is tracked separately.

Methodology

Failed scans

  • Failed scans remain visible in the public table for transparency.
  • Failed scans are excluded from aggregate manifest-capture rollups.
  • Their capture cell should read n/a (scan failed), not 0.0%.
Methodology

Safe-control validation

  • A safe-control target is included to test whether the scanner raises findings where no seeded vulnerability is expected.
  • That control result is published as safe-control validation, not as a scanner-wide false-positive rate.
  • This corpus is not the right surface for a broad false-positive claim because nearly every target is intentionally vulnerable by design.
Methodology

Taxonomy normalization

Taxonomy normalization distinguishes between findings on deliberately vulnerable training material and findings on ordinary software. That distinction prevents the public record from overstating what the corpus demonstrates.

In this record, normalization means:

  • intentionally vulnerable benchmark or training-app findings become TRAINING_APP_INTENTIONAL
  • controlled local seeded cases become TEST_FIXTURE
  • only production-code discoveries would remain eligible for TRUE_POSITIVE or NOVEL_FINDING
Methodology

Java Spring

The Java Spring rows currently shown in the public record come from pre-fix production scan IDs captured before the later Java fetch-window correction. They remain historically accurate, but they are not the final word on post-fix Java coverage.

Until Q1 completes, Java Spring stays qualified.

Methodology

Production-code validation

Production-code validation is a separate evidence stream from the training corpus. Firefox and later maintained production repositories are the right place to evaluate:

  • practical false positives
  • operational usefulness
  • potential novelty
  • behavior on ordinary software rather than known-vulnerable labs
Worked example

Worked example: persisted rows vs unique classified findings

  • If the underlying artifact contains duplicate publication rows, the normalized record can legitimately show fewer unique classified findings than raw persisted rows.
  • In this record:
  • raw artifact rows: 275
  • duplicate rows removed: 6
  • unique classified findings: 269
Worked example

Worked example: manifest capture vs findings count

  • If a target has 4 seeded vulnerabilities and Vibe Check catches all 4, manifest capture is 100%.
  • If a target has 4 seeded vulnerabilities, Vibe Check catches 2, misses 2, and also finds 3 additional non-manifest issues, manifest capture is still 50%.
Worked example

Worked example: training corpus vs production code

  • A finding on OWASP WebGoat can be real and severe while still being classified TRAINING_APP_INTENTIONAL, because the repository is intentionally vulnerable by design.
  • A comparable finding on a maintained production repository would require different treatment and would remain eligible for production-oriented labels.
Worked example

Worked example: Java Spring qualification

  • A production row can be historically correct while still being qualified if later investigation proves the scan was captured before a known fetch or scoring correction.
  • In that case the record should say both things plainly:
  • the original row remains the public measured result for that scan ID
  • later reruns or fixes may justify a future replacement row
Methodology

Reproducibility and audit trail

  • Corpus manifest: tests/comparison_corpus/repositories.json
  • Source scan IDs are preserved in the public record
  • Validation overlays are preserved privately for audit, normalization, and founder review

Anyone can rerun the corpus and verify: 1. the same targets and refs 2. persisted findings and severity mix 3. exploitability-retention outcomes 4. normalized taxonomy counts 5. the distinction between corpus validation and production-code validation