ABSTRACT Background Traditional Chinese medicine (TCM) with knowledge‐intensive framework poses unique challenges to performance for large language models (LLMs). Although TCM‐specific benchmarks and models have been developed, the performance of lightweight LLMs remains insufficiently investigated. This study presents a systematic evaluation and comparison of large‐scale and lightweight LLMs to assess their capabilities and deployment trade‐offs. Methods We developed TCM‐related question‐answering, a dataset comprising 801 questions derived from TCM textbooks. Eleven LLMs were evaluated under zero‐shot and few‐shot prompting conditions in both English and Chinese. Performance was primarily measured by accuracy. Results Large‐scale LLMs achieved high accuracy on single‐choice (69.01%–90.92%) and true/false questions (52.34%–59.38%) but performed poorly on multiple‐choice questions, with a maximum accuracy of only 8.40%. Lightweight LLMs (2.10%–49.48%) generally lagged behind larger LLMs (6.30%–95.07%). However, Qwen3‐1.7B (5.92%–54.20%) stood out and even surpassed the domain‐specialized TCMChat‐7B (2.10%–36.98%). Few‐shot prompting enhanced performance in 8/11 (72.7%) of the models, Chinese prompts yielded better results than English in 9/11 (81.8%) of the models. Symptomatic diagnosis emerged as the most challenging reasoning category across all models (16.75%–48.07%). Conclusion This study demonstrates that although large‐scale LLMs exhibit strong knowledge recall in TCM, their suboptimal performance on multiple‐choice questions and substantial computational costs may limit their practical applicability in clinical settings. The robust performance of Qwen3‐1.7B indicates that effective model optimization and domain‐specific training may offer greater advantages than simply increasing model size. While the current evaluation is based on examination‐style tasks and does not involve real‐world clinical decision‐making, our findings provide insights to support the deployment of optimized models in resource‐constrained healthcare environments.
Li et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: