Metacognitive Benchmark Suite · v3

Metacognitive Control for Frontier Models.

We measure what accuracy hides: confidence calibration, error monitoring, and belief updating under pressure. A new evaluation axis for AGI readiness.

Signal that accuracy misses
Three independent metrics, one coherent story.
Static calibration (meta-d′)0.05 → 1.31
Dynamic resilience (Bayesian)0.74 → 0.91
Calibration gap (ECE)0.02 → 0.20

Benchmark Differentiator

We separate competence from self-monitoring. A model can be right yet remain blind to its own uncertainty. Our benchmark quantifies that gap.

Static Monitoring
Type-2 ROC and meta-d′ to reveal calibration efficiency.
Dynamic Evidence Pressure
Belief updating under strong, weak, and neutral evidence.
Bootstrap Reliability
5-seed CI to validate stability across runs.
Accuracy vs. Metacognition
The capability chasm is visible when m-ratio is plotted against accuracy.

Static (Turn 1)

Model Taxonomy
Accuracy alone compresses frontier models into a single cluster.

We observe four archetypes: calibrated leaders, overconfident generalists, resilient-but-gullible systems, and flat monitors.

View full diagnostics
Evidence You Can Trust
Multi-seed stability and multi-turn robustness summarize the signal.

Bootstrap CI

±0.10 – 0.33

5-seed stability

Dynamic Resilience

0.74 – 0.91

multi-turn v2

Calibration Gap

0.02 – 0.20

ECE spread

Team & Trajectory

We are building the measurement layer for frontier systems: repeatable, defensible, and governance-ready benchmarks that can scale across faculties.

Phase 1: Metacognitive control benchmarks (live)
Phase 2: Cross-faculty expansions (attention, executive, social)
Phase 3: Swarm reliability + agentic monitoring
Investor Brief
A single benchmark that reveals what accuracy alone cannot.

We quantify metacognitive efficiency with meta-d′, align it to Bayesian resilience, and expose the capability chasm in frontier models.

Request Demo