What question did this study set out to answer?

This research aims to evaluate how the language of input affects the diagnostic reasoning capabilities of large language models in clinical scenarios.

May 27, 2026Open Access

Language-Specific Differences in Large Language Model Diagnostic Reasoning: A Translation-Controlled Clinical Vignette Study

Read Full Paperexternally

Key Points

This research aims to evaluate how the language of input affects the diagnostic reasoning capabilities of large language models in clinical scenarios.
Translation-controlled in silico study using 30 real-patient clinical vignettes.
Paired English and Polish conditions with back-translated prompts.
720 evaluations from 360 unique model–language–vignette responses scored by physician raters.
Models showed significant disparities in performance based on input language, especially Qwen2.5 performing better in English.
No significant English-Polish differences observed for GPT-5 and Bielik in total scores.
Language impacts were greatest in differential diagnosis quality and examination planning.

Abstract

Background: Large language models (LLMs) are increasingly being evaluated for clinically relevant diagnostic tasks, yet their performance may vary across languages. We aimed to determine whether input language influences LLM diagnostic reasoning in vignette-based clinical tasks and to inform multilingual predeployment evaluation for non-English healthcare systems. Methods: In this translation-controlled in silico study, 30 real-patient’s clinical vignettes were presented in paired English- and Polish-language conditions using back-translated prompts and cases. Six LLMs were evaluated with a structured reflection framework adapted from medical education. The study included 720 rater-level evaluations and 360 unique model–language–vignette responses. Responses were independently scored by 2 physician raters, with major discrepancies adjudicated by a third physician. The primary outcome was total rubric score. Secondary outcomes included differential diagnosis quality, justification, appropriateness of additional examinations, final diagnosis, and triage accuracy. Exploratory analyses assessed the number and cost of recommended examinations. Results: The effect of language differed significantly by model. Qwen2.5, Llama3.3, Meditron3, and OpenBioLLM performed significantly better in English, with the largest gap observed for Qwen2.5. GPT-5 and Bielik showed no statistically detectable English-Polish difference in overall score in this sample. Language-related differences were most evident in differential diagnosis quality, justification, and examination planning rather than in final diagnosis alone. Exploratory economic analyses suggested model- and language-dependent differences in testing burden, with broader suggested workups generally associated with higher diagnostic costs. Language robustness was not a consistent property of clinically evaluated LLMs. Performance differences were concentrated in reasoning and workup domains that are safety-relevant if these systems are used clinically. Conclusions: Multilingual clinical performance of LLMs is strongly model dependent. Language-specific evaluation should be considered before deployment in non-English healthcare systems.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Jakub Magdziarz Ibrahim-El-Nur

Medical University of Warsaw

Wojciech Kaczmarek

Weronika Winiarska

Journals

Journal of Clinical Medicine

Actions

Institutions

Medical University of Warsaw

Warsaw University of Technology

AGH University of Krakow

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Language-Specific Differences in Large Language Model Diagnostic Reasoning: A Translation-Controlled Clinical Vignette Study

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study