Abstract Background Foundation models have shown remarkable potential in medical imaging by leveraging extensive pretraining on general datasets to enable fine‐tuning for specific tasks. This is thought to be particularly beneficial for tasks where annotated data is scarce. A key underlying assumption, however, is that these models can learn from small amounts of training data more efficiently than existing state‐of‐the‐art models. Purpose This study aims to characterize the performance of two major foundation segmentation models (SAM and MedSAM) when fine‐tuned to segment neuroanatomic structures across a spectrum of dataset sizes, compared to a standard fully‐supervised UNet model. Methods This study used 1,113 T1‐weighted 3D MRIs from the Human Connectome Project's Young Adult cohort with corresponding Freesurfer‐generated, manually‐refined segmentations of 93 gray and white matter regions. The dataset was divided into 891 (80%) training MRIs, 111 (10%) validation MRIs, and 111 (10%) testing MRIs. SAM and MedSAM models were first fine‐tuned and compared against a standard UNet model using Dice score to establish the baseline performance using all training 3D volumes. Subsequently, MedSAM and UNet models were fine‐tuned across a varying number of training volumes to assess performance with diminishing dataset size, down to a single MRI, as well as no MRIs (zero‐shot) for the MedSAM and SAM models. Results Using the entire training set, UNet outperformed MedSAM and SAM across most regions, with median Dice scores of 0.88 versus 0.82 and 0.84, respectively ( p < 0.001). With diminishing dataset size, UNet continued to perform as well as or better than MedSAM in the three studied regions, down to even a single 3D volume. In the zero‐shot setting, SAM and MedSAM showed some ability to segment with overall median Dice scores of 0.66 and 0.59, respectively. Conclusions SAM and MedSAM did not outperform a standard UNet model in segmentation tasks, even in extremely limited training data settings, contrary to the foundation model hypothesis, suggesting that foundation models do not necessarily yield superior fine‐tuned performance compared to standard segmentation models in the low data setting. Instead, the potential benefit of foundation models will depend on the characteristics of the task at hand and the behavior and capacity of the specific foundation model in question. Thus, it will be essential to benchmark against standard supervised deep learning methods for each distinct application to demonstrate the added value of using a foundation model.
Nair et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: