Purpose: Automated whole-slide image (WSI) analysis, specifically applications of deep learning (DL)-based algorithms, has been enabling automated detection, classification, segmentation, and prognosis for various diseases. Performance evaluation plays an important role in the success of these complex big-data-based technologies. Our purpose is to conduct a performance evaluation of DL segmentation models applied to a breast cancer WSI dataset provided by the Tumor InfiltratinG lymphocytes in breast cancER challenge and investigate methodological issues in the assessment of WSI segmentation models. Approach: We evaluated the performance of DL models in the segmentation of tumoral and stromal regions and the effect of color normalization on improving the performance of these models when the training and testing data are from different sources. One important issue is the aggregation of image segmentation performance when the reference standard includes annotations only from selected regions of interest (ROIs). We introduced three different methods for aggregating performance based on different units of analysis (pixels, ROIs, and slides) and a bootstrap method to estimate the variance of the performance results at the slide level. Results: We found that using different units of analysis can produce not just different mean performance estimates but also different levels of uncertainty. Our results also showed that color normalization significantly improved DL model performance when the training and testing data are from different sources. Conclusions: Our study demonstrates the importance of image acquisition, study design, and statistical analysis methods used in the performance evaluation of computational pathology applications.
Arab et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: