OBJECTIVE: To systematically evaluate the diagnostic performance of artificial intelligence (AI) models for hepatocellular carcinoma (HCC) and to pool sensitivity and specificity using a diagnostic meta-analysis, with further assessment of the robustness of findings across different validation levels. METHODS: Original studies on AI-based diagnosis of HCC were systematically searched. Literature screening, data extraction, and quality assessment were independently performed by two reviewers. Extracted data included basic study characteristics, model type, validation level, and diagnostic performance metrics, including sensitivity, specificity, area under the curve (AUC), and extractable true-positive (TP), false-positive (FP), false-negative (FN), and true-negative (TN) values. If a single study reported results from different validation levels or analytical scenarios, these were extracted as separate study records for quantitative synthesis. Methodological quality was assessed using the QUADAS-2 tool. Formal validation sets (validation, independent validation, or external validation) were included in the primary analysis, whereas development sets and internal validation sets were included in the sensitivity analysis. Because only one eligible external validation study was available for early HCC, this outcome was presented descriptively only. For studies with extractable 2 × 2 table data, pooled sensitivity and specificity were estimated using the Reitsma bivariate random-effects model, and a summary receiver operating characteristic (SROC) curve was constructed. An additional sensitivity analysis retaining a single highest-validation-level record per study was performed to assess the influence of potential non-independence among records from the same study. RESULTS: A total of 11 studies were included in the qualitative analysis. Because some studies reported results from different validation levels or analytical scenarios, separate study records were extracted for quantitative synthesis. Of these, four formal validation records were included in the primary analysis, seven records were included in the sensitivity analysis, and one early HCC record was included for descriptive presentation. The primary analysis showed that the pooled sensitivity and specificity of AI models for HCC diagnosis were 0.904 (95% CI: 0.845-0.942) and 0.971 (95% CI: 0.891-0.993), respectively. In the sensitivity analysis, after inclusion of development and internal validation datasets, the pooled sensitivity was 0.895 (95% CI: 0.831-0.936) and the pooled specificity was 0.935 (95% CI: 0.846-0.974). Overall, the currently available records reporting AUC values suggested that AI models had good discriminative ability for HCC diagnosis. In the early HCC analysis, a single external validation study reported a sensitivity of 0.88 and a specificity of 1.00; however, no pooled analysis was performed because of the limited evidence. In the additional study-level sensitivity analysis, the pooled sensitivity and specificity were 0.889 and 0.964, respectively, which were close to the record-level primary estimates. Exploratory assessment according to QUADAS-2 domains suggested that patient selection bias, particularly selected case-control comparisons, may have contributed to the high pooled specificity. CONCLUSION: Current evidence suggests that AI models may have promising potential as auxiliary diagnostic tools for HCC. However, because the primary pooled estimates were based on only four formal validation records with heterogeneous data modalities, reference standards, study populations, and validation designs, these results should be interpreted as preliminary and hypothesis-generating rather than definitive evidence of clinical readiness. Further high-quality, multicenter, prospective external validation studies with complete patient-level 2 × 2 data and clinically relevant comparator populations are needed to clarify the real-world clinical utility of AI models for HCC diagnosis, particularly for early HCC.
Ren et al. (Mon,) studied this question.