What question did this study set out to answer?

Evaluate the effectiveness of deep learning models for segmenting white matter hyperintensities in heterogeneous MRI scans.

February 2, 2026Open Access

Benchmark White Matter Hyperintensity Segmentation Methods Fail on Heterogeneous Clinical MRI: A New Dataset and Deep Learning–Based Solutions

Key Points

Evaluate the effectiveness of deep learning models for segmenting white matter hyperintensities in heterogeneous MRI scans.
Introduced a new clinical WMH dataset from 195 brain MRI scans across 71 scanners.
Evaluated two deep learning models: nnU-Net and a fine-tuned version of MedSAM.
Annotated WMHs manually by experts to establish a benchmark for performance validation.
Benchmark methods showed poor generalization on diverse MRIs, missing small lesions and creating false positives.
Robust-WMH-UNet achieved a median Dice similarity coefficient of 0.768 with improved accuracy.
Robust-WMH-SAM reached competitive performance with a median DSC of up to 0.750 after limited training.

Abstract

Abstract Existing automated methods for white matter hyperintensity (WMH) segmentation often generalize poorly to heterogeneous clinical MRI due to variability in scanner types, field strengths, and protocols. To address this challenge, we introduce a diverse clinical WMH dataset and evaluate two deep learning–based solutions: an nnU-Net model trained directly on the data and a foundation model adapted through fine-tuning. This retrospective study included 195 routine brain MRI scans acquired from 71 scanners between June 2006 and October 2022. Participants ranged in age from 46 to 87 years (median, 70 years; 94 females). WMHs were manually annotated by an experienced rater and reviewed under neuroradiologist supervision. Several benchmark segmentation methods were evaluated against these annotations. We then developed Robust-WMH-UNet by training nnU-Net on the dataset and Robust-WMH-SAM by fine-tuning MedSAM, a vision foundation model. Benchmark methods demonstrated poor generalization, frequently missing small lesions and producing false positives in anatomically complex regions such as the septum pellucidum. Robust-WMH-UNet achieved superior accuracy (median Dice similarity coefficient DSC, 0.768) with improved specificity, while Robust-WMH-SAM attained competitive performance (median DSC up to 0.750) after only limited training, reaching acceptable accuracy within a single epoch. This new clinically representative dataset provides a strong foundation for developing robust WMH algorithms, enabling fair cross-method comparisons, and supporting the translation of segmentation models into routine clinical practice.

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper