What question did this study set out to answer?

This research aims to evaluate and compare the performance of large-scale and lightweight large language models (LLMs) on TCM exam questions.

February 14, 2026

Evaluating Large‐Scale and Lightweight Large Language Models for Traditional Chinese Medicine Exam Questions: A Comparative Study

Key Points

This research aims to evaluate and compare the performance of large-scale and lightweight large language models (LLMs) on TCM exam questions.
Developed a dataset of 801 questions from TCM textbooks.
Evaluated 11 LLMs under zero-shot and few-shot prompting conditions in both English and Chinese.
Measured performance primarily by accuracy in answering questions.
Large-scale LLMs achieved high accuracy on single-choice questions (69.01%–90.92%) but struggled with multiple-choice questions (max accuracy 8.40%).
Lightweight LLMs generally performed poorer than larger LLMs, with accuracies ranging from 2.10% to 49.48%.
Few-shot prompting improved performance in 72.7% of the models tested, and Chinese prompts were more effective than English in 81.8% of the models.

Abstract

ABSTRACT Background Traditional Chinese medicine (TCM) with knowledge‐intensive framework poses unique challenges to performance for large language models (LLMs). Although TCM‐specific benchmarks and models have been developed, the performance of lightweight LLMs remains insufficiently investigated. This study presents a systematic evaluation and comparison of large‐scale and lightweight LLMs to assess their capabilities and deployment trade‐offs. Methods We developed TCM‐related question‐answering, a dataset comprising 801 questions derived from TCM textbooks. Eleven LLMs were evaluated under zero‐shot and few‐shot prompting conditions in both English and Chinese. Performance was primarily measured by accuracy. Results Large‐scale LLMs achieved high accuracy on single‐choice (69.01%–90.92%) and true/false questions (52.34%–59.38%) but performed poorly on multiple‐choice questions, with a maximum accuracy of only 8.40%. Lightweight LLMs (2.10%–49.48%) generally lagged behind larger LLMs (6.30%–95.07%). However, Qwen3‐1.7B (5.92%–54.20%) stood out and even surpassed the domain‐specialized TCMChat‐7B (2.10%–36.98%). Few‐shot prompting enhanced performance in 8/11 (72.7%) of the models, Chinese prompts yielded better results than English in 9/11 (81.8%) of the models. Symptomatic diagnosis emerged as the most challenging reasoning category across all models (16.75%–48.07%). Conclusion This study demonstrates that although large‐scale LLMs exhibit strong knowledge recall in TCM, their suboptimal performance on multiple‐choice questions and substantial computational costs may limit their practical applicability in clinical settings. The robust performance of Qwen3‐1.7B indicates that effective model optimization and domain‐specific training may offer greater advantages than simply increasing model size. While the current evaluation is based on examination‐style tasks and does not involve real‐world clinical decision‐making, our findings provide insights to support the deployment of optimized models in resource‐constrained healthcare environments.

Bookmark

Evaluating Large‐Scale and Lightweight Large Language Models for Traditional Chinese Medicine Exam Questions: A Comparative Study

Key Points

Abstract

Cite This Study

Also Consider

Also Consider