What question did this study set out to answer?

To develop a video-level surgical pre-training framework that improves understanding of surgical contexts.

February 7, 2026Open Access

Large-scale self-supervised video foundation model for intelligent surgery

Key Points

To develop a video-level surgical pre-training framework that improves understanding of surgical contexts.
Constructed a large-scale surgical video dataset with 3650 videos and 3.55 million frames.
Introduced SurgVISTA for joint spatial and temporal representation learning.
Implemented image-level knowledge distillation using an expert model.
SurgVISTA outperforms existing pre-trained models in various surgical tasks.
Improved spatiotemporal understanding leads to better decision-making in surgery.
Enhanced learning of fine-grained anatomical and semantic features.

Abstract

Computer-Assisted Intervention has the potential to revolutionize modern surgery, with surgical scene understanding serving as a critical component in supporting decision-making and improving procedural efficacy. While existing AI-driven approaches alleviate annotation burdens via self-supervised spatial representation learning, their lack of explicit temporal modeling during pre-training fundamentally restricts the capture of dynamic surgical contexts, resulting in incomplete spatiotemporal understanding. In this work, we introduce the first video-level surgical pre-training framework that enables joint spatiotemporal representation learning from large-scale surgical video data. To achieve this, we constructed a large-scale surgical video dataset comprising 3650 videos and 3.55 million frames, spanning more than 20 surgical procedures and over 10 anatomical structures. Building upon this dataset, we propose SurgVISTA ( Surg ical Vi deo-level S patial- T emporal A rchitecture), a reconstruction-based pre-training method that jointly captures intricate spatial structures and temporal dynamics. Additionally, SurgVISTA incorporates image-level knowledge distillation guided by a surgery-specific expert model to enhance the learning of fine-grained anatomical and semantic features. To validate its effectiveness, we established a comprehensive benchmark comprising 13 video-level datasets spanning six surgical procedures across four tasks. Extensive experiments demonstrate that SurgVISTA consistently outperforms both natural- and surgical-domain pre-trained models, demonstrating strong potential to advance intelligent surgical systems in clinically meaningful scenarios.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper