July 28, 2025

Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: A Mixed Method Approach (Preprint)

Key Points

Large language models exhibit variable performance across three languages in Swiss medical multiple-choice questions.
Quantitative and qualitative analyses reveal that while many questions are answered correctly, accuracy can differ significantly by model.
Most models demonstrated reasoning errors, struggling with identifying the best answers despite showing factual accuracy.
The study emphasizes the need for ongoing evaluation and transparent reporting to integrate AI effectively in medical education.

Abstract

BACKGROUND Artificial intelligence continues to transform healthcare, offering promising applications in clinical practice and medical education. While large language models as a form of generative artificial intelligence have shown potential to match or surpass medical students in licensing examinations, their performance varies across languages. Recent studies highlight the complex influence and interdependency of factors such as language and model type on large language models’ accuracy, yet cross-language comparisons remain underexplored. Switzerland’s multilingual medical licensing exam provides a unique opportunity to investigate these dynamics. OBJECTIVE This study evaluates the performance of large language models in Swiss medical multiple-choice questions across three languages, aiming to uncover model capabilities in a multilingual medical education context. METHODS For this study, 150 publicly accessible multilingual multiple-choice questions from an online self-assessment tool were selected and analysed. A mixed-method approach was implemented using quantitative and qualitative methods to evaluate large language models outputs. Several large language models developed by OpenAI, MetaAI, Anthropic, MistralAI, and DeepSeek were evaluated by prompting them to answer these questions in a text-only format. RESULTS The performance of large language models on medical questions varied by model and language. While most models answered most multiple-choice questions correctly, accuracy differed across models. All models showed reasoning errors in the qualitative analysis and sometimes struggled to identify the most correct answers, despite factual accuracy on the represented topic being demonstrated. CONCLUSIONS While our results are in line with previous demonstrations of the high potential of large language models in answering multilingual medical exam questions, this study highlights the importance of careful model selection, prompt design, and awareness of performance variability across languages. There is a need for ongoing evaluation as well as transparent reporting to ensure reliable integration of large language models into medical education contexts.

Bookmark

Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: A Mixed Method Approach (Preprint)

Key Points

Abstract

Cite This Study

Also Consider

Also Consider