What question did this study set out to answer?

The aim is to evaluate deep learning segmentation models using breast cancer whole-slide images and address methodological issues.

June 19, 2026Open Access

Methodological considerations for evaluating deep learning segmentation models in digital pathology whole-slide images.

Key Points

The aim is to evaluate deep learning segmentation models using breast cancer whole-slide images and address methodological issues.
Assessment of DL models for segmentation of tumoral and stromal regions.
Examination of color normalization effects on model performance across varying data sources.
Introduction of three aggregation methods for performance assessment (pixels, ROIs, slides) with bootstrap variance estimation.
Different analysis units yield varying mean performance and uncertainty levels.
Color normalization enhanced model performance when training and testing datasets differed.

Abstract

Purpose: Automated whole-slide image (WSI) analysis, specifically applications of deep learning (DL)-based algorithms, has been enabling automated detection, classification, segmentation, and prognosis for various diseases. Performance evaluation plays an important role in the success of these complex big-data-based technologies. Our purpose is to conduct a performance evaluation of DL segmentation models applied to a breast cancer WSI dataset provided by the Tumor InfiltratinG lymphocytes in breast cancER challenge and investigate methodological issues in the assessment of WSI segmentation models. Approach: We evaluated the performance of DL models in the segmentation of tumoral and stromal regions and the effect of color normalization on improving the performance of these models when the training and testing data are from different sources. One important issue is the aggregation of image segmentation performance when the reference standard includes annotations only from selected regions of interest (ROIs). We introduced three different methods for aggregating performance based on different units of analysis (pixels, ROIs, and slides) and a bootstrap method to estimate the variance of the performance results at the slide level. Results: We found that using different units of analysis can produce not just different mean performance estimates but also different levels of uncertainty. Our results also showed that color normalization significantly improved DL model performance when the training and testing data are from different sources. Conclusions: Our study demonstrates the importance of image acquisition, study design, and statistical analysis methods used in the performance evaluation of computational pathology applications.

Methodological considerations for evaluating deep learning segmentation models in digital pathology whole-slide images.

Key Points

Abstract

Cite This Study

Also Consider

Also Consider