BACKGROUND Artificial intelligence continues to transform healthcare, offering promising applications in clinical practice and medical education. While large language models as a form of generative artificial intelligence have shown potential to match or surpass medical students in licensing examinations, their performance varies across languages. Recent studies highlight the complex influence and interdependency of factors such as language and model type on large language models’ accuracy, yet cross-language comparisons remain underexplored. Switzerland’s multilingual medical licensing exam provides a unique opportunity to investigate these dynamics. OBJECTIVE This study evaluates the performance of large language models in Swiss medical multiple-choice questions across three languages, aiming to uncover model capabilities in a multilingual medical education context. METHODS For this study, 150 publicly accessible multilingual multiple-choice questions from an online self-assessment tool were selected and analysed. A mixed-method approach was implemented using quantitative and qualitative methods to evaluate large language models outputs. Several large language models developed by OpenAI, MetaAI, Anthropic, MistralAI, and DeepSeek were evaluated by prompting them to answer these questions in a text-only format. RESULTS The performance of large language models on medical questions varied by model and language. While most models answered most multiple-choice questions correctly, accuracy differed across models. All models showed reasoning errors in the qualitative analysis and sometimes struggled to identify the most correct answers, despite factual accuracy on the represented topic being demonstrated. CONCLUSIONS While our results are in line with previous demonstrations of the high potential of large language models in answering multilingual medical exam questions, this study highlights the importance of careful model selection, prompt design, and awareness of performance variability across languages. There is a need for ongoing evaluation as well as transparent reporting to ensure reliable integration of large language models into medical education contexts.
Strasser et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: