What question did this study set out to answer?

The research aims to apply vision-language models to cardiac surgery lecture videos for improved comprehension.

January 21, 2026Open Access

Towards vision-language models for cardiac surgery lecture videos

Key Points

The research aims to apply vision-language models to cardiac surgery lecture videos for improved comprehension.
Curated a dataset of cardiac surgery lecture videos.
Used a large language model to extract procedural steps from videos.
Augmented the training dataset with the extracted information.
Data augmentation improved model performance in text-based video retrieval tasks.
Preliminary findings indicate enhanced understanding of surgical scenes.

Abstract

Abstract Vision-language models represent an emerging paradigm that leverages natural language to train vision systems with broad capabilities. Recently, the use of surgical lecture videos has emerged as a promising method for developing models capable of understanding surgical scenes. In this work, we aim to translate these developments to the domain of cardiac surgery, which is marked by heterogeneity and complexity of surgical cases. To this end, we curate a dataset of cardiac surgery lecture videos and augment the training dataset by using a Large Language Model (LLM) to extract procedural steps for each surgery. Preliminary results suggest that this form of data augmentation can enhance model performance on text-based video retrieval tasks.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper