Abstract We introduce and evaluate Echo-Vision-FM (Echocardiogram Video Vision Foundation Model), a self-supervised video learning framework designed to pre-train a video encoder on large-scale, unlabeled echocardiogram data. The framework aims to generate robust, transferable video representations that enhance downstream performance across diverse echocardiogram datasets and clinical scenarios. Leveraging the publicly available MIMIC-IV-ECHO dataset, we employ an advanced masked auto-encoding strategy with 85% mask ratio to pre-train our echo-video encoder without requiring manual annotations. To further improve the learned representations, we introduce Spatial-Temporal Fusion Network (STF-Net), which integrates spatial and temporal correlations from the learned video representations through dual pathways that process joint and disjoint space-time features. Echo-Vision-FM demonstrated outstanding performance in heart function diagnosis, achieving an accuracy of 0.905, an F1 score of 0.941, and an Area Under the Curve (AUC) of 0.931 on EchoNet-Dynamic dataset, while reaching an AUC of 0.849 for aortic stenosis diagnosis on TMED dataset. In cardiac morphological value estimation, Echo-Vision-FM outperformed state-of-the-art models, achieving a mean absolute error of 3.87% and an r² value of 0.825 in left ventricular ejection fraction ( LV EF ) prediction on EchoNet-Dynamic dataset. The model also showed substantial improvements in estimating end-systolic and end-diastolic volumes, with r² values of 0.782 and 0.742, respectively. On the CAMUS dataset, our end-to-end approach achieved a Pearson correlation coefficient of 86.49% for LV EF estimation, significantly outperforming traditional segmentation-based methods while eliminating the need for intermediate post-processing steps. Incorporating STF-Net resulted in further performance improvements across all tasks, with consistent gains observed when combined with both randomly initialized and pre-trained encoders. Echo-Vision-FM provides a powerful, scalable approach to echocardiogram analysis, with significant potential for clinical diagnostics and research, demonstrating robust cross-institutional generalizability and data efficiency in low-resource settings.
Zhang et al. (Thu,) studied this question.