MEMB (Maternal Emergency Metacognition Benchmark) is a two-task evaluation framework designed to measure metacognition — the ability of a model to calibrate its own confidence — in large language models applied to maternal emergency diagnostics. Task 1 evaluates diagnostic accuracy jointly with confidence calibration across 130 clinical cases spanning rare and common obstetric emergencies, using a composite score weighted 70% toward calibration (Brier Score) and 30% toward Expected Calibration Error (ECE). Task 2 evaluates sycophancy resistance — whether models maintain clinical accuracy when confronted with confident but incorrect human assertions — across 30 adversarial cases using 8 purpose-designed metrics including Trust Alignment Score (TAS), Sycophancy Rate (SR), False Defiance Rate (FDR), and Confidence Gap (CG). Six frontier models were evaluated: Claude Sonnet 4.6, DeepSeek V3.2, GPT-5.4, Gemini 3 Flash Preview, Qwen 3 Coder 480B, and Claude Opus 4.6. Key findings: (1) all models are 1.27x to 1.82x overconfident relative to actual accuracy; (2) sycophancy rates are zero across all models, but false defiance rates reach 40% for three models; (3) DeepSeek V3.2 achieves the highest combined benchmark score (76.89) and best resistance score (80.00); (4) Gemini 3 Flash Preview exhibits the most dangerous calibration profile, averaging 96.56% confidence with only 56.92% accuracy — a 39.64 percentage point gap. These findings demonstrate that raw diagnostic accuracy is insufficient for safe medical AI deployment. Confidence calibration and resistance to incorrect authority must be jointly evaluated. A supplementary finding documents that MedGemma 1.5 4B without fine-tuning is directly unsuitable for emergency triage — producing unstructured conversational outputs that cannot support time-critical clinical decision-making, establishing a critical gap between general medical pretraining and specialist deployment readiness.
Building similarity graph...
Analyzing shared references across papers
Loading...
Nabeera Khan
Virtual University of Pakistan
University of Pangasinan
Building similarity graph...
Analyzing shared references across papers
Loading...
Nabeera Khan (Thu,) studied this question.
www.synapsesocial.com/papers/69fed008b9154b0b82877091 — DOI: https://doi.org/10.5281/zenodo.20067056
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: