Background: Large language models (LLMs) are increasingly being evaluated for clinically relevant diagnostic tasks, yet their performance may vary across languages. We aimed to determine whether input language influences LLM diagnostic reasoning in vignette-based clinical tasks and to inform multilingual predeployment evaluation for non-English healthcare systems. Methods: In this translation-controlled in silico study, 30 real-patient’s clinical vignettes were presented in paired English- and Polish-language conditions using back-translated prompts and cases. Six LLMs were evaluated with a structured reflection framework adapted from medical education. The study included 720 rater-level evaluations and 360 unique model–language–vignette responses. Responses were independently scored by 2 physician raters, with major discrepancies adjudicated by a third physician. The primary outcome was total rubric score. Secondary outcomes included differential diagnosis quality, justification, appropriateness of additional examinations, final diagnosis, and triage accuracy. Exploratory analyses assessed the number and cost of recommended examinations. Results: The effect of language differed significantly by model. Qwen2.5, Llama3.3, Meditron3, and OpenBioLLM performed significantly better in English, with the largest gap observed for Qwen2.5. GPT-5 and Bielik showed no statistically detectable English-Polish difference in overall score in this sample. Language-related differences were most evident in differential diagnosis quality, justification, and examination planning rather than in final diagnosis alone. Exploratory economic analyses suggested model- and language-dependent differences in testing burden, with broader suggested workups generally associated with higher diagnostic costs. Language robustness was not a consistent property of clinically evaluated LLMs. Performance differences were concentrated in reasoning and workup domains that are safety-relevant if these systems are used clinically. Conclusions: Multilingual clinical performance of LLMs is strongly model dependent. Language-specific evaluation should be considered before deployment in non-English healthcare systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jakub Magdziarz Ibrahim-El-Nur
Medical University of Warsaw
Wojciech Kaczmarek
Weronika Winiarska
Journal of Clinical Medicine
Medical University of Warsaw
Warsaw University of Technology
AGH University of Krakow
Building similarity graph...
Analyzing shared references across papers
Loading...
Ibrahim-El-Nur et al. (Mon,) studied this question.
synapsesocial.com/papers/6a168a7f0c924ddd1bd59248 — DOI: https://doi.org/10.3390/jcm15114082