We thank Lei et al. 1 for their interest and careful review of our meta-analysis on AI-assisted capsule endoscopy (CE) 2. We agree with the concerns raised, and we appreciate the opportunity to correct and clarify our reporting. First, we acknowledge an error in the presentation of Figure 4 (pooled specificity for AI-assisted CE) 2. Specifically, the specificity for Afonso et al. 3 was incorrectly displayed on an inconsistent scale (plotted near zero), which in turn distorted the pooled specificity estimate. We have corrected the plot (Figure 1) and re-ran the meta-analysis. Following correction, the pooled specificity for AI-assisted CE is 91.15% (95% CI 89.55%–92.75%). Because the incorrect pooled specificity value influenced interpretive statements, we also feel that the abstract and discussion need amendment accordingly. We emphasize that AI-assisted CE shows higher pooled sensitivity and negative predictive value (NPV) than conventional CE, with high specificity, supporting its potential role as an adjunct to clinician interpretation for small-bowel lesion detection. However, conventional CE showed higher pooled diagnostic accuracy and Positive Predictive Value (PPV). The significant heterogeneity across AI systems, thresholds, and lesion targets, however, limits the interpretability of a single summary estimate. Larger prospective studies and standardized diagnostic-accuracy syntheses using hierarchical models are therefore needed to define which AI approaches provide consistent, generalizable benefit in routine practice. Second, we agree that the clinical implications of false positives (FPs) and false negatives (FNs) in capsule endoscopy are not equivalent: FNs risk missed pathology, whereas many FPs can be dismissed at clinician over-read. We also agree that FP burden should be judged not only clinically but operationally (e.g., number of flagged frames, secondary-review time, and workflow efficiency). However, emerging real-world SBCE data suggest that well-designed AI-assisted reading can reduce overall review time substantially while maintaining or improving clinically relevant detection, indicating that sensitivity gains may not necessarily translate into an untenable FP-driven workload when AI is implemented as a triage or assisted-reading tool with clinician adjudication 4, 5. The extent of any FP burden is likely to be system- and threshold-dependent; therefore, we feel that future studies should routinely report workload metrics alongside diagnostic performance and explore threshold calibration to balance detection benefits against operational efficiency. Third, we concur that substantial heterogeneity across included studies limits the interpretability of a single pooled estimate. We agree that diagnostic test accuracy meta-analyses are ideally synthesized using hierarchical models (bivariate random-effects or HSROC), which jointly analyze sensitivity and specificity and can accommodate correlation and threshold effects across studies. This approach is recommended in contemporary diagnostic test accuracy methodology guidance and reporting standards. In future updates, particularly as more studies provide complete 2 × 2 data, we plan to implement a hierarchical bivariate/HSROC synthesis alongside any conventional random-effects pooling. We appreciate the opportunity to clarify these points and corresponding interpretive statements to ensure the record is accurate. The authors declare no conflicts of interest.
Dhali et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: