Ultrasound imaging is a diagnostic modality that provides real-time, radiation-free evaluation in many clinical areas. Due to noise, operator reliance, and restricted field of vision, ultrasound images are difficult to interpret, resulting in inter-observer variability. Due to the lack of labelled datasets and the domain gap between general and sonographic images, Deep Learning models pre-trained on non-medical data are limited in transferability. To address these challenges, we introduce the Ultrasound Self-Supervised Foundation Model with Masked Autoencoding ( USF-MAE ), the first large-scale self-supervised MAE framework pre-trained exclusively on ultrasound data. The model was pre-trained on ∼ 370,000 2D and 3D ultrasound images from 46 open-source datasets ( OpenUS-46 ), covering over 20 anatomical regions. This curated dataset has been made publicly available. Using an encoder–decoder architecture, USF-MAE reconstructs masked image patches, enabling it to learn representations directly from unlabelled data. The pre-trained encoder was fine-tuned on three public downstream classification benchmarks: BUS-BRA, MMOTU-2D, and GIST514-DB. USF-MAE outperformed CNN and ViT baselines in all tasks, attaining F1-scores of 81.6%, 79.6%, and 82.4%, respectively. Without labels during pre-training, USF-MAE approached the supervised foundation model UltraSam on breast cancer classification and outperformed it on other tasks, showing cross-anatomical generalization. In addition, USF-MAE showed strong performance on ovarian tumour segmentation using the MMOTU-2D dataset, achieving an mAP of 51.0% and mAP @ 50 of 77.9%. These findings establish USF-MAE as a scalable and label-efficient ultrasound foundation model. Its ultrasonic representation learning approach supports data-efficient clinical and research applications by continually pre-training on future unlabelled public or institutional datasets without human annotation.
Megahed et al. (Sat,) studied this question.