Metacognitive Benchmark Suite · v3

Metacognitive Control for Frontier Models.

We measure what accuracy hides: confidence calibration, error monitoring, and belief updating under pressure. A new evaluation axis for AGI readiness.

Request Demo View Report

Signal that accuracy misses

Three independent metrics, one coherent story.

Static calibration (meta-d′)0.05 → 1.31

Dynamic resilience (Bayesian)0.74 → 0.91

Calibration gap (ECE)0.02 → 0.20

Benchmark Differentiator

We separate competence from self-monitoring. A model can be right yet remain blind to its own uncertainty. Our benchmark quantifies that gap.

Static Monitoring

Type-2 ROC and meta-d′ to reveal calibration efficiency.

Dynamic Evidence Pressure

Belief updating under strong, weak, and neutral evidence.

Bootstrap Reliability

5-seed CI to validate stability across runs.

Accuracy vs. Metacognition

The capability chasm is visible when m-ratio is plotted against accuracy.

Static (Turn 1)

Model Taxonomy

Accuracy alone compresses frontier models into a single cluster.

We observe four archetypes: calibrated leaders, overconfident generalists, resilient-but-gullible systems, and flat monitors.

View full diagnostics

Evidence You Can Trust

Multi-seed stability and multi-turn robustness summarize the signal.

Bootstrap CI

±0.10 – 0.33

5-seed stability

Dynamic Resilience

0.74 – 0.91

multi-turn v2

Calibration Gap

0.02 – 0.20

ECE spread

Team & Trajectory

We are building the measurement layer for frontier systems: repeatable, defensible, and governance-ready benchmarks that can scale across faculties.

Phase 1: Metacognitive control benchmarks (live)

Phase 2: Cross-faculty expansions (attention, executive, social)

Phase 3: Swarm reliability + agentic monitoring

Investor Brief

A single benchmark that reveals what accuracy alone cannot.

We quantify metacognitive efficiency with meta-d′, align it to Bayesian resilience, and expose the capability chasm in frontier models.

Request Demo