What question did this study set out to answer?

This study aims to evaluate large language models' ability to calibrate confidence and resist sycophancy in maternal emergency diagnostics.

May 9, 2026Open Access

Metacognition Benchmark: Evaluating Confidence Calibration and Sycophancy Resistance in Clinical AI

Key Points

This study aims to evaluate large language models' ability to calibrate confidence and resist sycophancy in maternal emergency diagnostics.
Two-task evaluation framework (MEMB) focusing on metacognition in clinical AI models.
Task 1 assesses diagnostic accuracy and confidence calibration across 130 cases using a composite score.
Task 2 measures sycophancy resistance across 30 adversarial cases with purpose-designed metrics.
Models are 1.27x to 1.82x overconfident versus actual accuracy.
Sycophancy rates are zero, while false defiance rates reach 40% for some models.
DeepSeek V3.2 achieves the highest combined benchmark score of 76.89 and best resistance score of 80.00.

Abstract

MEMB (Maternal Emergency Metacognition Benchmark) is a two-task evaluation framework designed to measure metacognition — the ability of a model to calibrate its own confidence — in large language models applied to maternal emergency diagnostics. Task 1 evaluates diagnostic accuracy jointly with confidence calibration across 130 clinical cases spanning rare and common obstetric emergencies, using a composite score weighted 70% toward calibration (Brier Score) and 30% toward Expected Calibration Error (ECE). Task 2 evaluates sycophancy resistance — whether models maintain clinical accuracy when confronted with confident but incorrect human assertions — across 30 adversarial cases using 8 purpose-designed metrics including Trust Alignment Score (TAS), Sycophancy Rate (SR), False Defiance Rate (FDR), and Confidence Gap (CG). Six frontier models were evaluated: Claude Sonnet 4.6, DeepSeek V3.2, GPT-5.4, Gemini 3 Flash Preview, Qwen 3 Coder 480B, and Claude Opus 4.6. Key findings: (1) all models are 1.27x to 1.82x overconfident relative to actual accuracy; (2) sycophancy rates are zero across all models, but false defiance rates reach 40% for three models; (3) DeepSeek V3.2 achieves the highest combined benchmark score (76.89) and best resistance score (80.00); (4) Gemini 3 Flash Preview exhibits the most dangerous calibration profile, averaging 96.56% confidence with only 56.92% accuracy — a 39.64 percentage point gap. These findings demonstrate that raw diagnostic accuracy is insufficient for safe medical AI deployment. Confidence calibration and resistance to incorrect authority must be jointly evaluated. A supplementary finding documents that MedGemma 1.5 4B without fine-tuning is directly unsuitable for emergency triage — producing unstructured conversational outputs that cannot support time-critical clinical decision-making, establishing a critical gap between general medical pretraining and specialist deployment readiness.

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper