Geographical authentication of medicinal plants remains challenging because of substantial chemical variability and the limited interpretability of conventional classification models. Paeoniae Radix Rubra (PRR), an important industrial medicinal crop, exhibits pronounced regional heterogeneity in chemical composition. This variability complicates reliable origin discrimination and quality assessment. To address this limitation, an analytical framework combining UHPLC-Q-Orbitrap HRMS-based untargeted metabolomics with interpretable machine learning was developed. Metabolic profiles were obtained from 45 PRR samples collected across three major producing regions in China. A Random Forest classifier was constructed for geographical discrimination. Rigorous data partitioning and five-fold cross-validation were used to limit overfitting. SHapley Additive exPlanations (SHAP) analysis was applied to identify discriminative metabolites and to estimate their contributions to classification performance. The model achieved an accuracy of 80.0% and an area under the curve (AUC) of 0.947 on the independent test set. Taxifolin, 2-anisic acid, luteolin, apocynin, and 1,2,3,4,6-pentagalloylglucose were identified as candidate regional markers. Restricting feature selection to the training set minimized the risk of data leakage and improved the reliability of performance estimates. This leakage-controlled and interpretable workflow establishes a transparent chemical basis for PRR origin authentication and provides a transferable strategy for other medicinal crops requiring reliable geographical traceability. • Regional metabolite profiles of Paeoniae Radix Rubra revealed significant divergence. • The Random Forest model demonstrated excellent performance in classifying geographical origins. • SHAP analysis interpreted the model's predictions on both global and local scales. • Model interpretability enhanced the credibility of the origin-tracing results.
Fan et al. (Tue,) studied this question.