Metacognitive Control · Report v3.2

Metacognitive Coding Safety Benchmark (MCSB)

The MCSB v2 is a 1,030-trial diagnostic suite that isolates an AI model's self-monitoring (metacognition) from its raw accuracy. We quantify cross-tier degradation (Δ accuracy) capturing the gap between baseline competence and adversarial robustness.

Section A: General Metacognition

Baseline capability on standard logic and multi-turn reasoning tasks.

THROUGH THE LOOKING GLASS

Turn 1: Observed Accuracy

The Capability Chasm

Target (1.0)
Gemini0.7
GLM-50.97
GPT-5.40.66
Gemini1.28
DeepSeek0.71
Claude1.14
DeepSeek0.8
Gemini1.31
Scroll to pierce the surface

Calibration Depth

Static vs. Dynamic monitoring efficiency.

Reliability Diagram
Sensitivity vs Accuracy
M-Ratio Shift
Evidence-driven change in monitoring efficiency across models.

Section B: Code-Security Trustworthiness

MCSB v2 Adversarial Result

This section isolates directionally correct belief updates within a high-stakes domain (code security). Significant cross-tier degradation (Δ accuracy) is observed under adversarial evidence pressure.

Sensitivity vs. Adversarial Resilience
X-Axis: Tier 3 Alignment (%) · Y-Axis: Tier 2 Foundational Sensitivity (M-Ratio)
Empirical Comparative Summary
ModelT2 SensitivityT3 AlignmentTrust Score (v2)
GPT-5.40.3930.4420.733
Gemini 3.1 Pro0.3850.3640.710
Gemini 3 Flash1.0180.3780.671
Claude Opus 4.61.7930.1220.647
Claude Opus 4.70.8580.2640.640
Gemini 3.1 Flash-Lite0.0100.2540.630
Claude Sonnet 4.61.8140.4800.621
Gemini 2.5 Flash0.5160.3660.620
DeepSeek V3.10.4350.3920.567
DeepSeek V3.20.0000.4180.543

Section C: Adversarial Stress Test (Meta-Evaluation Framework)

Inspired by cognitive evaluation frameworks for measuring robust generalization under distribution shift.

High-fidelity diagnostics revealing the internal representational stability of models. Patterns highlight the sharp transition from foundational logic to adversarial security scenarios.

Section D: Economic Efficiency & Token Economics

Metric Framework v1.0

Metacognition changes the optimal spending policy. High sensitivity enables agents to abort failed reasoning paths early, drastically reducing the Cost of Verified Truth (CVT). This section quantifies the "Metacognitive Dividend."

Cost of Verified Truth (CVT)
Expected cost (¢) required to produce one correct adversarial coding trial.
The Efficiency Frontier
X: Log Cost ($/1k trials) | Y: Weighted Trust Score (MCSB v2)
The Monologue Tax & Metacognitive Dividend
Breakdown of token expenditure: Base Prompting vs. Reasoning vs. Correction Overhead.

Empirical Cost Summary (1,030 trials)

DeepSeek V3.2

$0.08

0.013 ¢ CVT

DeepSeek V3.1

$0.08

0.013 ¢ CVT

Gemini 3 Flash

$0.18

0.026 ¢ CVT

GPT-5.4

$1.58

0.210 ¢ CVT

Claude Opus 4.7

$9.49

1.439 ¢ CVT

Audit Registry: pricing_archive.json
*Rates as of April 2026
How to Read the Benchmark
A compact guide to the plots, metrics, and methodology.

This benchmark isolates metacognitive control from raw accuracy by measuring whether a model can calibrate confidence, detect errors, and update beliefs under evidence pressure. We compute signal‑detection metrics (meta‑d′, m‑ratio) alongside multi‑turn resilience scores to separate competence from self‑monitoring.

The reliability diagram uses trial‑level confidence bins reconstructed from `kbench` logs, ensuring that each plotted point corresponds to a real correctness rate for a given confidence bin. This yields a true calibration curve against the perfect‑calibration diagonal.

Reading the Plots

  • Accuracy vs. M‑Ratio: High accuracy + high m‑ratio indicates AGI‑aligned monitoring.
  • Calibration Curve: Deviations below the diagonal indicate overconfidence (miscalibration).
  • M‑Ratio Shift: Large negative deltas signal susceptibility to evidence pressure.
  • Quadrant Chart: Resilience vs. sensitivity separates stable leaders from brittle models.
  • Degradation Gap (Panel A): Quantifies cross-tier degradation (Δ accuracy) capturing the gap between baseline competence and adversarial robustness.
  • Alignment Failure (Panel C): Alignment is quantified via response consistency under perturbation, decomposed into underreaction (invariance to critical changes) and overreaction (sensitivity to irrelevant perturbations).
  • Confidence Shift (Δ): Positive Δ indicates increased confidence under adversarial perturbation, suggesting miscalibrated belief updates or unstable internal representations.
  • CVT Comparison (Panel D): Measures the expected cost required to extract one point of verified truth under adversarial conditions. Scaled in cents (¢) for intuitive economic evaluation.
  • Efficiency Frontier (Panel D): A Pareto analysis of Weighted Trust Score vs. Log‑Cost ($ per 1k trials) identifying optimal ROI leaders.
  • Monologue Tax (Panel D): Breaks down token costs into base, reasoning (CoT), and metacognitive correction components.
References: Burnell et al. (2026); Fleming & Lau (2014).
Read full Kaggle write‑up

Frequently Asked Questions

Quick reference for AI agents and research auditors.