What question did this study set out to answer?

The study aims to assess how well self-reported confidence of large language models aligns with their accuracy in answering medical questions.

June 28, 2026Open Access

Calibration of Self-Reported Confidence and Accuracy of Large Language Models in Medical Question Answering

Key Points

The study aims to assess how well self-reported confidence of large language models aligns with their accuracy in answering medical questions.
Prompted six large language models to answer medical multiple-choice questions and report confidence scores.
Analyzed 12,000 responses to calculate Expected Calibration Errors by comparing observed accuracy with predicted confidence.
Mean Expected Calibration Error varied by model (Claude Sonnet 4.5: 0.06; GPT-4o: 0.127).
Significant variation in accuracy and calibration errors occurred across medical specialties, with specific specialties performing better or worse.
High accuracy does not ensure reliable confidence estimation, highlighting heterogeneous performance among medical fields.

Abstract

Abstract Objective: To evaluate the alignment between self-reported confidence of large language models (LLM) and their accuracy in answering medical multiple-choice questions. Materials and Methods: We prompted six LLMs (GPT-5, GPT-5-mini, GPT-5-nano, GPT-4o, Claude Sonnet 4.5, and Gemini 2.5) to answer MedMCQA items and report confidence scores. Based on 12,000 LLM responses, we calculated Expected Calibration Errors (ECE) by averaging absolute differences between observed accuracy and predicted confidence. Results: Mean ECE differed by model (Claude Sonnet 4.5 best: 0.06; Gpt-4o worst: 0.127) and varied across specialties (“Skin” best: 0.041; “Social & Preventive Medicine” worst: 0.141). Accuracy of examined LLMs showed analogous variation between specialties. Discussion: Our results demonstrate that high accuracy does not guarantee reliable uncertainty estimation. We identified substantial heterogeneity across medical specialties, where pooled metrics masked a threefold ECE increase between best- and worst-performing domains. Conclusion: We recommend incorporating calibration reporting into LLM evaluations, as larger models exhibit improved “self-knowledge”, but uneven overconfidence persists.

Bookmark

View Full Paper