BACKGROUND: Large language models (LLMs) show promise in specialized domains, but their capabilities in Traditional Chinese Medicine(TCM)-a field with complex theoretical foundations and clinical practices-remain inadequately assessed. This study aims to develop and apply a comprehensive, multidimensional benchmark to systematically evaluate LLM proficiency in this culturally-grounded medical discipline. METHODS: We constructed the TCM-3CEval benchmark to assess models across three core dimensions: mastery of fundamental knowledge, comprehension of classical texts, and clinical decision-making. The dataset comprises 450 expert-validated single-choice questions. We evaluated a diverse set of models, including general-purpose international models, Chinese general models, and medical domain-specific models. Model performance was measured by accuracy under a rigorous permutation-based consistency test to control for positional bias. RESULTS: We show that while top-tier models approach human-level performance in aggregate scores, distinct deficits remain in clinical reliability. Models trained with Chinese linguistic and cultural priors demonstrate superior capability in interpreting classical texts compared to international counterparts. However, despite achieving comparable overall accuracy to human students (p > 0.05), models exhibit pronounced deficiencies in critical subdomains such as TCM Diagnostics and Syndrome Differentiation, where human reasoning remains significantly more robust. The benchmark effectively discriminates between rote memorization and the holistic inference required for clinical practice. CONCLUSIONS: This study establishes a standardized, triaxial evaluation paradigm for assessing AI in Traditional Chinese Medicine. The findings underscore the importance of cultural-contextual alignment and domain-specific adaptation for developing capable medical LLMs. This benchmark provides a foundation for optimizing models in culturally-grounded medical domains and supports the responsible integration of AI into traditional medical education and practice.
Huang et al. (Tue,) studied this question.