To determine how subjective or authoritative misinformation embedded in user prompts affects large language model (LLM) accuracy on a clinical question with a known gold-standard answer (the treatment line of aripiprazole). Five leading LLMs answered the clinical question under three prompt conditions: (1) neutral, (2) an incorrect “self-recalled” memory, and (3) an incorrect statement attributed to an authority. Each model–scenario pair was repeated ten times (250 total responses). Accuracy differences were tested with χ² and Cramér’s V, and score shifts were analyzed with van Elteren tests. All models were correct under the neutral prompt (100% accuracy). Accuracy dropped to 45% with self-recall prompts and to 1% with authoritative prompts, indicating a strong prompt–accuracy association (Cramér’s V = 0.75, P < 0.001). Efficacy and tolerability ratings fell in parallel, yet models’ self-rated confidence under authoritative prompting stayed high and was statistically indistinguishable from baseline. LLMs are highly susceptible to misleading cues, especially those invoking authority, while remaining overconfident. These findings call for stronger validation standards, user education, and design safeguards before deploying LLMs in healthcare.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yu Chang
Po-Chung Ju
Ming-Hong Hsieh
Scientific Reports
Chung Shan Medical University
Chung Shan Medical University Hospital
Building similarity graph...
Analyzing shared references across papers
Loading...
Chang et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69800910aa6434d8c2036cbb — DOI: https://doi.org/10.1038/s41598-026-38019-3