We measure what accuracy hides: confidence calibration, error monitoring, and belief updating under pressure. A new evaluation axis for AGI readiness.
We separate competence from self-monitoring. A model can be right yet remain blind to its own uncertainty. Our benchmark quantifies that gap.
We are building the measurement layer for frontier systems: repeatable, defensible, and governance-ready benchmarks that can scale across faculties.