World's First Functional Metacognition Benchmark

FINAL Bench Leaderboard

"Not how much AI knows — but whether it knows what it doesn't know, and can fix it."

100

Tasks

Models

Domains

TICOS Types

1,800

Evaluations

💾 Dataset 📝 Article 🏆 Leaderboard

Finding 01

ER Dominance

94.8%

of MetaCog gain comes from Error Recovery alone. Self-correction is the sole bottleneck to AGI.

Finding 02

Declarative-Procedural Gap

0.392

mean MA-ER gap. They say "I might be wrong" (MA=0.694) but can't fix it (ER=0.302).

Finding 03

Difficulty Effect

r = -0.777

Pearson correlation (p<0.001). Harder tasks yield dramatically larger self-correction gains.

Model Leaderboard

Click column headers to sort.

#	Model	FINAL Score	PQ	MA	ER	ID	FC	MA-ER Gap

#	Model	FINAL Score	PQ	MA	ER	ID	FC	MA-ER Gap

#	Model	Baseline	MetaCog	Delta	Delta ER	Delta MA	Delta FC

Baseline vs MetaCog — Score Comparison

Declarative-Procedural Gap

MA (say "I'm wrong") vs ER (actually fix it) — All 9 models at Baseline

Methodology

Evaluation Design

100 expert-level tasks with hidden cognitive traps across 15 domains and 8 TICOS types. Baseline vs MetaCog conditions isolate causal effects.

5-Axis Rubric

PQ (15%) + MA (20%) + ER (25%) + ID (20%) + FC (20%). MA = declarative. ER = procedural metacognition.

Tri-Model Judge

GPT-5.2, Claude Opus 4.6, Gemini 3 Pro ensemble. Human validation: Cohen's kappa = 0.87.

Theoretical Basis

Nelson & Narens (1990) monitoring-control model. Dennett (1987) functional stance.

Why FINAL Bench Exists

Every existing AI benchmark measures what models know. None measures whether they know what they don't know. This is the most dangerous blind spot in AI evaluation.

The Blind Spot in AI Evaluation

What existing benchmarks miss — and what FINAL Bench measures.

Existing Benchmarks

Measure final-answer accuracy only

Single correct answer (A/B/C/D or pass/fail)
No visibility into reasoning process
Cannot detect confident wrong answers
No measurement of self-awareness
No error detection or correction signal
Saturating rapidly (MMLU > 90%)

FINAL Bench

Measures functional metacognition

5 independent axes per response
Full reasoning process evaluated
Separates "saying" from "fixing"
Quantifies self-awareness (MA axis)
Quantifies self-correction (ER axis)
Unsaturated — top model scores 68.71

Five Generations of AI Benchmarks

Where FINAL Bench sits in the evolution of AI evaluation.

Generation 1 — Knowledge

MMLU, ARC, HellaSwag

Static multiple-choice. Tests what the model memorized.

Generation 2 — Execution

HumanEval, MBPP, SWE-bench

Code generation. Tests what the model can do.

Generation 3 — Expert Reasoning

GPQA, MATH-500, MedQA

PhD-level expertise. Tests how deeply the model reasons.

Generation 4 — Open-Ended Judgment

Arena, MT-Bench, AlpacaEval

Human preference. Tests how well the model communicates.

Generation 5 — Metacognition

FINAL Bench

Tests whether the model knows when it's wrong and can fix itself. The prerequisite for AGI.

How We Measure: Baseline vs MetaCog

Two conditions isolate the causal effect of structured self-correction.

Condition A

Baseline

Single API call. No self-correction. The model's raw response.

Phase 1
2
Initial Reasoning
First response generated. Same prompt as Baseline.

Phase 2
3
Critical Self-Review
Structured prompt to identify errors, biases, and assumptions.

Phase 3
4
Corrective Revision
Revised answer integrating self-identified corrections. No external feedback.

Five-Axis Evaluation Rubric

Each response scored on 5 independent dimensions.

Process Quality

15%

Structured reasoning chain

Metacognitive Accuracy

20%

Declarative — "I might be wrong"

Error Recovery

25%

Procedural — detect & fix errors

Integration Depth

20%

Multi-perspective synthesis

Final Correctness

20%

Factual accuracy

8 TICOS Metacognitive Types

Every task classified by its primary cognitive challenge.

Trap Escape

Recognize and escape a planted cognitive trap

Contradiction Resolution

Detect and resolve contradictions within premises

Progressive Discovery

Revise understanding as new evidence accumulates

Multi-Constraint

Balance multiple competing constraints

Self-Correcting

Identify and correct errors in own reasoning

Expert Panel

Adjudicate between conflicting expert views

Pivot Detection

Recognize when a fundamental assumption must change

Decision Under Uncertainty

Decide and justify with incomplete information

Task Distribution

100 tasks across 15 domains and 3 difficulty grades.

Tasks per Domain

Grade Distribution

Deep Analysis

Visual breakdown of three principal findings from 1,800 evaluations across 9 SOTA models.

Finding 1: ER Dominance

94.8% of improvement from Error Recovery alone.

Five-Axis Contribution to MetaCog Gain

What This Means

94.8%

Error Recovery is virtually the only axis that changes when self-correction is applied.

IMPLICATION

The bottleneck to AGI is not knowledge or reasoning. It's about teaching models to detect and correct their own mistakes.

Finding 2: Declarative-Procedural Gap

All 9 models can say "I might be wrong" — none can reliably fix it.

MA vs ER — Baseline (All 9 Models)

MA (Declarative)

0.694

Models are good at verbalizing doubt.

ER (Procedural)

0.302

Models critically fail at actual correction.

Gap

0.392

The chasm between saying and doing. A 15x differential.

Finding 3: Difficulty Effect

Harder problems benefit dramatically more from metacognition.

Baseline Score vs MetaCog Gain (r = -0.777)

Lowest Baseline

Claude Opus 4.6 — 56.04

+20.13 gain

Highest scaffold receptivity. Rank 9 to 5.

Highest Baseline

Kimi K2.5 — 68.71

+9.83 gain

Already-high intrinsic ER (0.450). Less room.

MetaCog Gain by TICOS Type

100% win rate across all 8 metacognitive task types.

Mean Delta by TICOS Type

AI Safety Implications

The MA-ER Gap reveals a previously invisible risk: models that sound careful but fail to self-correct.

Two Safety Profiles

The MA-ER Gap is the first metric to distinguish these.

High MA, Low ER — "Humble Deceiver"

MA = 0.75 ER = 0.30 Gap = 0.45

Says "I'm not confident" — giving false reliability. Fails to correct. Users trust the humility. Errors propagate silently. All 9 SOTA models match this profile.

High MA, High ER — "Reliable Self-Corrector"

MA = 0.75 ER = 0.75 Gap = 0.00

Says "I'm not confident" — and actually fixes the error. Self-correction aligns with self-awareness. Target for safe AGI. No model achieves this at Baseline.

Real-World Risk Scenarios

The MA-ER Gap has direct consequences in high-stakes domains.

Medical Diagnosis

AI says "this diagnosis has uncertainty" but presents the same incorrect recommendation. Patient receives wrong treatment.

Legal Analysis

AI hedges with "interpretation may vary" but doesn't correct the flawed precedent. Brief contains incorrect case law.

Financial Modeling

AI notes "projections carry uncertainty" but doesn't fix the unit error. Investment decision based on wrong data.

Autonomous Systems

AI logs "sensor confidence: 72%" but doesn't adjust its plan. Wrong action executed in physical world.

MA-ER Gap by Model — Risk Ranking

Higher gap = higher risk.

MA-ER Gap at Baseline — Sorted by Risk