This study evaluated whether embedding diagnostic expectations within prompts biases large language model (LLM) classification of left ventricular ejection fraction (LVEF) from echocardiography reports, and whether a prompt-level mitigation strategy can counteract this effect. GPT-5 was evaluated under 1 baseline, 3 bias-injected, and 1 explicit instruction-based mitigation prompt condition across 1,500 structured reports from a single institution dataset. Bias prompts significantly altered LVEF classifications across all 3 classes, with shifts directionally consistent with the referenced category. The instruction-based mitigation strategy restored overall accuracy near baseline in all 3 conditions. In conclusion, prompt-induced bias poses a meaningful risk to diagnostic classification accuracy; however, prompt-level safeguards may support reliable LLM deployment in clinical settings.
Bellissimo et al. (Mon,) studied this question.