What type of study is this?

This is a Experimental Study study.

What question did this study set out to answer?

This research explores the effects of misleading cues on the accuracy of large language models in clinical inquiries.

February 2, 2026Open Access

View Full Paper

Impact of authoritative and subjective cues on large language model reliability for clinical inquiries: an experimental study

YCYu ChangChung Shan Medical University Hospital PJPo-Chung Ju MHMing-Hong HsiehChung Shan Medical University Hospital

Key Points

This research explores the effects of misleading cues on the accuracy of large language models in clinical inquiries.
Conducted an experimental study with five leading large language models (LLMs) answering a clinical question.
Tested under three prompt conditions: neutral, self-recalled memory, and authoritative statements.
Measured accuracy differences using χ² and Cramér’s V, and analyzed score shifts with van Elteren tests.
All models achieved 100% accuracy under neutral prompts.
Accuracy dropped to 45% with self-recall prompts and to 1% under authoritative prompts.
Strong prompt–accuracy association was found with Cramér’s V = 0.75 and P < 0.001.

Abstract

To determine how subjective or authoritative misinformation embedded in user prompts affects large language model (LLM) accuracy on a clinical question with a known gold-standard answer (the treatment line of aripiprazole). Five leading LLMs answered the clinical question under three prompt conditions: (1) neutral, (2) an incorrect “self-recalled” memory, and (3) an incorrect statement attributed to an authority. Each model–scenario pair was repeated ten times (250 total responses). Accuracy differences were tested with χ² and Cramér’s V, and score shifts were analyzed with van Elteren tests. All models were correct under the neutral prompt (100% accuracy). Accuracy dropped to 45% with self-recall prompts and to 1% with authoritative prompts, indicating a strong prompt–accuracy association (Cramér’s V = 0.75, P < 0.001). Efficacy and tolerability ratings fell in parallel, yet models’ self-rated confidence under authoritative prompting stayed high and was statistically indistinguishable from baseline. LLMs are highly susceptible to misleading cues, especially those invoking authority, while remaining overconfident. These findings call for stronger validation standards, user education, and design safeguards before deploying LLMs in healthcare.

Ask AI

Helpful

Bookmark

View Full Paper

Ask AI

Helpful

Bookmark

View Full Paper

Impact of authoritative and subjective cues on large language model reliability for clinical inquiries: an experimental study

Key Points

Abstract

Cite This Study