What question did this study set out to answer?

The research aims to create a benchmark to assess large language model proficiency in traditional Chinese medicine (TCM).

May 15, 2026Open Access

A triaxial benchmark for assessing responses from large language models in traditional Chinese medicine

Key Points

The research aims to create a benchmark to assess large language model proficiency in traditional Chinese medicine (TCM).
Developed the TCM-3CEval benchmark assessing three dimensions: fundamental knowledge, classical texts, clinical decision-making.
Constructed a dataset of 450 expert-validated questions for evaluation.
Measured model performance using accuracy and permutation-based consistency tests.
Top-performing models achieve human-level accuracy overall (p > 0.05) but show deficits in clinical reliability.
Chinese-centric models outperform international models in interpreting classical texts.
Significant deficiencies identified in areas like TCM Diagnostics and Syndrome Differentiation compared to human reasoning.

Abstract

BACKGROUND: Large language models (LLMs) show promise in specialized domains, but their capabilities in Traditional Chinese Medicine(TCM)-a field with complex theoretical foundations and clinical practices-remain inadequately assessed. This study aims to develop and apply a comprehensive, multidimensional benchmark to systematically evaluate LLM proficiency in this culturally-grounded medical discipline. METHODS: We constructed the TCM-3CEval benchmark to assess models across three core dimensions: mastery of fundamental knowledge, comprehension of classical texts, and clinical decision-making. The dataset comprises 450 expert-validated single-choice questions. We evaluated a diverse set of models, including general-purpose international models, Chinese general models, and medical domain-specific models. Model performance was measured by accuracy under a rigorous permutation-based consistency test to control for positional bias. RESULTS: We show that while top-tier models approach human-level performance in aggregate scores, distinct deficits remain in clinical reliability. Models trained with Chinese linguistic and cultural priors demonstrate superior capability in interpreting classical texts compared to international counterparts. However, despite achieving comparable overall accuracy to human students (p > 0.05), models exhibit pronounced deficiencies in critical subdomains such as TCM Diagnostics and Syndrome Differentiation, where human reasoning remains significantly more robust. The benchmark effectively discriminates between rote memorization and the holistic inference required for clinical practice. CONCLUSIONS: This study establishes a standardized, triaxial evaluation paradigm for assessing AI in Traditional Chinese Medicine. The findings underscore the importance of cultural-contextual alignment and domain-specific adaptation for developing capable medical LLMs. This benchmark provides a foundation for optimizing models in culturally-grounded medical domains and supports the responsible integration of AI into traditional medical education and practice.

Mark Helpful

Bookmark

Relay

View Full Paper