Reliable uncertainty quantification is essential for deploying image segmentation models in systems where inter-rater variability among experts is significant and must be accounted for to ensure dependable performance. A key unresolved question is whether multiple annotations per case are required during training to obtain robust uncertainty estimates. In this work, we provide analytical and empirical evidence addressing this issue. Using nine diverse publicly available datasets and the nnU-Net framework extended with Ensemble, Bayesian, Probabilistic, and Hierarchical Probabilistic models, we systematically compare training with single- versus multi-annotator annotations. Uncertainty was assessed using two complementary approaches: probability maps and disagreement-as-class. Performance was measured with established metrics, including A-Dice, GED, and Dice. Results show that ensemble-based probability map approaches consistently outperform other methods and achieve comparable performance under both single- and multi-annotation settings. In contrast, for the disagreement-as-class approach multi-annotation setting provides significant advantages over a single-annotation setting for capturing inter-expert variability, particularly in uncertainty-class segmentation. The numerical findings are supported by the provided analytical arguments. These findings indicate that multiple annotations per case may not be strictly necessary for training effective uncertainty-aware segmentation models, offering practical implications for reducing annotation costs and enabling scalable development of reliable uncertainty-aware systems.
Jalal et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: