What question did this study set out to answer?

This research aims to evaluate the metacognitive sensitivity of large language models (LLMs) during medical reasoning.

May 9, 2026Open Access

Large Language Models Show Metacognitive Sensitivity in Medical Reasoning

Key Points

This research aims to evaluate the metacognitive sensitivity of large language models (LLMs) during medical reasoning.
Developed a controlled clinical benchmark focusing on Alzheimer-type neurocognitive disorder vs. depression-related cognitive impairment.
Generated 45 synthetic vignettes with varying evidence attributes and presented under three prompt formats, resulting in 135 trials.
Tested GPT-4.1-nano for diagnostic choice and confidence behaviors, measuring outputs through diagnostic accuracy and confidence levels.
Diagnostic accuracy was 93.5% with mean confidence at 78.4%.
Confidence increased with clearer evidence but decreased in missing information scenarios.
Errors were noted in moderate, conflicting cases, where confidence was higher than empirical accuracy warranted.

Abstract

Large language models (LLMs) are increasingly evaluated and used in medicine. Clinical usefulness depends not only on answer accuracy, but also on whether confidence tracks evidence quality and uncertainty. Recent work has argued that LLMs lack essential metacognition for reliable medical reasoning, but metacognition can be operationalized in different ways, including missing-answer recognition, knowledge-gap detection, and confidence sensitivity to evidence and correctness. We developed a controlled, psychophysics-inspired clinical benchmark to test first-order diagnostic choice and second-order confidence behavior in a medical LLM. The benchmark focused on probable Alzheimer-type neurocognitive disorder (AT-NCD) versus depression-related cognitive impairment (DRCI). We generated 45 synthetic vignettes that varied evidence strength, conflicting evidence, and missing information. Each vignette was presented under three prompt variants, yielding 135 trials. In a pilot run with gpt-4.1-nano, all trials produced valid structured outputs. Across forced-choice trials, diagnostic accuracy was 93.5%, mean confidence was 78.4%, and AUROC2 was 0.876. Confidence increased with evidence distance from the diagnostic boundary, decreased in missing-information conditions, and remained higher on correct than on incorrect trials after adjustment for evidence strength and prompt format. These findings indicate partial metacognitive sensitivity rather than globally uninformative confidence. However, confidence was not uniformly reliable. Errors clustered in moderate, conflicting AT-NCD cases, where the model shifted toward DRCI and retained more confidence than empirical accuracy justified. Exploratory comparison across GPT-family models suggested that newer or nominally stronger models did not necessarily show better confidence–correctness discrimination. Thus, medical-LLM confidence should be measured directly rather than inferred from benchmark accuracy or model capability alone. This study establishes a reproducible framework for evaluating evidence sensitivity, metacognitive sensitivity, and localized calibration failure in medical LLMs.

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper

Cite This Study

Ahmad Nazzal (Thu,) studied this question.

synapsesocial.com/papers/69fed19ab9154b0b82878fe8 https://doi.org/https://doi.org/10.5281/zenodo.20072971

AI से पूछें

Bookmark

View Full Paper