Abstract Background MRI-based tumor segmentation could greatly support clinical assessment of diffuse midline glioma (DMG), yet translation of automated methods remains constrained by occasional model failures, as the performance required for clinical utility and the value of uncertainty estimates in detecting meaningful errors remain unclear. We systematically evaluate segmentation performance prediction, response label stability, and uncertainty estimation. Methods Whole tumor was segmented in a multicentric, international cohort of pre- and post-therapy multi-contrast MRIs (n = 403) of 107 DMG patients. Segmentations by a state-of-the-art deep learning model were dichotomized by Dice score into acceptable (Dice0.8) and poor (Dice0.8). We analyzed segmentation performance classification from image-derived features (imaging metadata, radiomic features, 3D brain MRI foundation model embeddings), and response assessments stemming from manual vs. automated segmentations (n = 51 patients with longitudinal follow-up). Using eyetracking, in a sub-study, we further quantified human segmentor (36 annotators) contour uncertainty (12 slices) contextualized with observer gaze patterns. Results Despite generally good performance (median Dice=0.77-0.81), auto-segmented volumes altered 20% of trajectory-based manual response labels (n = 10), predominantly misclassifying stable/progressive disease as partial response due to undersegmentation of post-treatment scans. Segmentation performance was best classified using a combination of whole image foundation model embeddings and segmented tumor volume (ROCAUC=0.81±0.05). Segmentation error correlated (|r|=0.9) with human contour uncertainty, supporting model-based uncertainty as a proxy for annotation difficulty. Image-derived attention features from deeper encoder layers explained substantially more uncertainty variance than eye-tracking features alone (R²: 24% vs. 2%). Human gaze attention overlapped most with U-Net bottleneck activations (Dice=0.6). A combined model integrating model attention and human visual behavior explained 39% of uncertainty variance. Conclusions Jointly, these results support the integration of performance- and uncertainty-aware segmentation frameworks to enable safe clinical deployment, scalable quality assurance, and reliable endpoint extraction from automated tumor segmentations in DMG.
Laslo et al. (Tue,) studied this question.