What question did this study set out to answer?

To evaluate how dataset size affects the performance of foundation segmentation models, SAM and MedSAM, in neuroanatomic segmentation compared to standard models.

February 19, 2026

Impact of dataset size on fine‐tuning foundation models for neuroanatomic segmentation: Testing the foundation model hypothesis

Key Points

To evaluate how dataset size affects the performance of foundation segmentation models, SAM and MedSAM, in neuroanatomic segmentation compared to standard models.
Analyzed 1,113 T1-weighted 3D MRIs from the Human Connectome Project.
Segmentations were manually refined using Freesurfer for 93 neuroanatomic regions.
Models were fine-tuned and evaluated using Dice scores across varying dataset sizes, from full data to zero-shot conditions.
UNet outperformed MedSAM and SAM with median Dice scores of 0.88 vs. 0.82 and 0.84, respectively (p < 0.001).
UNet maintained superior performance down to a single MRI, compared to foundation models.
In zero-shot conditions, SAM and MedSAM achieved median Dice scores of 0.66 and 0.59, respectively.

Abstract

Abstract Background Foundation models have shown remarkable potential in medical imaging by leveraging extensive pretraining on general datasets to enable fine‐tuning for specific tasks. This is thought to be particularly beneficial for tasks where annotated data is scarce. A key underlying assumption, however, is that these models can learn from small amounts of training data more efficiently than existing state‐of‐the‐art models. Purpose This study aims to characterize the performance of two major foundation segmentation models (SAM and MedSAM) when fine‐tuned to segment neuroanatomic structures across a spectrum of dataset sizes, compared to a standard fully‐supervised UNet model. Methods This study used 1,113 T1‐weighted 3D MRIs from the Human Connectome Project's Young Adult cohort with corresponding Freesurfer‐generated, manually‐refined segmentations of 93 gray and white matter regions. The dataset was divided into 891 (80%) training MRIs, 111 (10%) validation MRIs, and 111 (10%) testing MRIs. SAM and MedSAM models were first fine‐tuned and compared against a standard UNet model using Dice score to establish the baseline performance using all training 3D volumes. Subsequently, MedSAM and UNet models were fine‐tuned across a varying number of training volumes to assess performance with diminishing dataset size, down to a single MRI, as well as no MRIs (zero‐shot) for the MedSAM and SAM models. Results Using the entire training set, UNet outperformed MedSAM and SAM across most regions, with median Dice scores of 0.88 versus 0.82 and 0.84, respectively ( p < 0.001). With diminishing dataset size, UNet continued to perform as well as or better than MedSAM in the three studied regions, down to even a single 3D volume. In the zero‐shot setting, SAM and MedSAM showed some ability to segment with overall median Dice scores of 0.66 and 0.59, respectively. Conclusions SAM and MedSAM did not outperform a standard UNet model in segmentation tasks, even in extremely limited training data settings, contrary to the foundation model hypothesis, suggesting that foundation models do not necessarily yield superior fine‐tuned performance compared to standard segmentation models in the low data setting. Instead, the potential benefit of foundation models will depend on the characteristics of the task at hand and the behavior and capacity of the specific foundation model in question. Thus, it will be essential to benchmark against standard supervised deep learning methods for each distinct application to demonstrate the added value of using a foundation model.

Bookmark

Cite This Study

Nair et al. (Sun,) studied this question.

synapsesocial.com/papers/6996a898ecb39a600b3ef890 https://doi.org/https://doi.org/10.1002/mp.70337

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark