What question did this study set out to answer?

To create a benchmark dataset for assessing large language models in the context of the Chinese National Medical Licensing Examination.

April 19, 2026Open Access

A Dataset for Evaluating Large Language Models on Chinese National Medical Licensing Examinations

Key Points

To create a benchmark dataset for assessing large language models in the context of the Chinese National Medical Licensing Examination.
Developed CNMLEQA dataset with 9,890 questions for CNMLEQA-10k and 2,949 questions for CNMLEQA-3k.
Integrated question-answer pairs from sources like PubMed, GitHub, and MedExamLLM.
Annotated questions based on type and clinical scenarios by clinical experts.
Evaluated state-of-the-art LLMs including Gemini, DeepSeek, GPT, Qwen, and LLaMA.
Qwen2.5-32B achieved an accuracy of 90.88% on CNMLEQA-10k.
DeepSeek-R1 reached an accuracy of 91.59% on CNMLEQA-3k.
Fine-tuning experiments showed significant performance enhancements.

Abstract

Large language models (LLMs) are increasingly applied in medical education, question answering, and clinical reasoning, yet standardized datasets in non-English contexts remain limited. To address this gap, we present CNMLEQA, a benchmark dataset for evaluating LLMs on the Chinese National Medical Licensing Examination. The dataset integrates question-answer pairs from three sources, including PubMed, GitHub, and MedExamLLM. CNMLEQA comprises two subsets: CNMLEQA-10k (9,890 questions) and CNMLEQA-3k (2,949 questions), each consisting of multiple-choice questions with five options and one correct answer. Questions are annotated with key dimensions including: (1) question type (knowledge-based or case-based), (2) auxiliary metadata such as examination year, 3) clinical scenario information across five dimensions: disease or diagnosis, surgery, medication, laboratory examination, and symptom or sign. Annotation was conducted by clinical experts. To validate the dataset, we evaluated state-of-the-art LLMs including Gemini, DeepSeek, GPT, Qwen, and LLaMA, and conducted fine-tuning experiments specifically on Qwen models. Results show that Qwen2.5-32B achieved the accuracy of 90.88% on CNMLEQA-10k, while DeepSeek-R1 achieved the accuracy of 91.59% on CNMLEQA-3k. The fine-tuning experiments further demonstrated significant performance improvements. CNMLEQA provides a multidimensional, clinically grounded benchmark for advancing LLM evaluation in Chinese medical applications.

Bookmark

View Full Paper

Bookmark

View Full Paper

A Dataset for Evaluating Large Language Models on Chinese National Medical Licensing Examinations

Key Points

Abstract

Cite This Study