What question did this study set out to answer?

Evaluate Echo-Vision-FM for robust echocardiogram analysis and diagnosis enhancement.

December 11, 2025Open Access

Echo-Vision-FM: a pre-training and fine-tuning framework for echocardiogram video vision foundation model

Key Points

Evaluate Echo-Vision-FM for robust echocardiogram analysis and diagnosis enhancement.
Developed a self-supervised video learning framework using large-scale echocardiogram data.
Applied masked auto-encoding strategy for pre-training without manual annotations.
Introduced Spatial-Temporal Fusion Network to capture spatial and temporal correlations.
Achieved 0.905 accuracy and 0.941 F1 score on EchoNet-Dynamic dataset.
Obtained AUC of 0.931 for heart function diagnosis and 0.849 for aortic stenosis.
Reduced mean absolute error to 3.87% for left ventricular ejection fraction estimation.

Abstract

Abstract We introduce and evaluate Echo-Vision-FM (Echocardiogram Video Vision Foundation Model), a self-supervised video learning framework designed to pre-train a video encoder on large-scale, unlabeled echocardiogram data. The framework aims to generate robust, transferable video representations that enhance downstream performance across diverse echocardiogram datasets and clinical scenarios. Leveraging the publicly available MIMIC-IV-ECHO dataset, we employ an advanced masked auto-encoding strategy with 85% mask ratio to pre-train our echo-video encoder without requiring manual annotations. To further improve the learned representations, we introduce Spatial-Temporal Fusion Network (STF-Net), which integrates spatial and temporal correlations from the learned video representations through dual pathways that process joint and disjoint space-time features. Echo-Vision-FM demonstrated outstanding performance in heart function diagnosis, achieving an accuracy of 0.905, an F1 score of 0.941, and an Area Under the Curve (AUC) of 0.931 on EchoNet-Dynamic dataset, while reaching an AUC of 0.849 for aortic stenosis diagnosis on TMED dataset. In cardiac morphological value estimation, Echo-Vision-FM outperformed state-of-the-art models, achieving a mean absolute error of 3.87% and an r² value of 0.825 in left ventricular ejection fraction ( LV EF ) prediction on EchoNet-Dynamic dataset. The model also showed substantial improvements in estimating end-systolic and end-diastolic volumes, with r² values of 0.782 and 0.742, respectively. On the CAMUS dataset, our end-to-end approach achieved a Pearson correlation coefficient of 86.49% for LV EF estimation, significantly outperforming traditional segmentation-based methods while eliminating the need for intermediate post-processing steps. Incorporating STF-Net resulted in further performance improvements across all tasks, with consistent gains observed when combined with both randomly initialized and pre-trained encoders. Echo-Vision-FM provides a powerful, scalable approach to echocardiogram analysis, with significant potential for clinical diagnostics and research, demonstrating robust cross-institutional generalizability and data efficiency in low-resource settings.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper