Large language models (LLMs) are increasingly evaluated and used in medicine. Clinical usefulness depends not only on answer accuracy, but also on whether confidence tracks evidence quality and uncertainty. Recent work has argued that LLMs lack essential metacognition for reliable medical reasoning, but metacognition can be operationalized in different ways, including missing-answer recognition, knowledge-gap detection, and confidence sensitivity to evidence and correctness. We developed a controlled, psychophysics-inspired clinical benchmark to test first-order diagnostic choice and second-order confidence behavior in a medical LLM. The benchmark focused on probable Alzheimer-type neurocognitive disorder (AT-NCD) versus depression-related cognitive impairment (DRCI). We generated 45 synthetic vignettes that varied evidence strength, conflicting evidence, and missing information. Each vignette was presented under three prompt variants, yielding 135 trials. In a pilot run with gpt-4.1-nano, all trials produced valid structured outputs. Across forced-choice trials, diagnostic accuracy was 93.5%, mean confidence was 78.4%, and AUROC2 was 0.876. Confidence increased with evidence distance from the diagnostic boundary, decreased in missing-information conditions, and remained higher on correct than on incorrect trials after adjustment for evidence strength and prompt format. These findings indicate partial metacognitive sensitivity rather than globally uninformative confidence. However, confidence was not uniformly reliable. Errors clustered in moderate, conflicting AT-NCD cases, where the model shifted toward DRCI and retained more confidence than empirical accuracy justified. Exploratory comparison across GPT-family models suggested that newer or nominally stronger models did not necessarily show better confidence–correctness discrimination. Thus, medical-LLM confidence should be measured directly rather than inferred from benchmark accuracy or model capability alone. This study establishes a reproducible framework for evaluating evidence sensitivity, metacognitive sensitivity, and localized calibration failure in medical LLMs.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ahmad Nazzal
Building similarity graph...
Analyzing shared references across papers
Loading...
Ahmad Nazzal (Thu,) studied this question.
www.synapsesocial.com/papers/69fed19ab9154b0b82878fe8 — DOI: https://doi.org/10.5281/zenodo.20072971