Abstract Vision-language models represent an emerging paradigm that leverages natural language to train vision systems with broad capabilities. Recently, the use of surgical lecture videos has emerged as a promising method for developing models capable of understanding surgical scenes. In this work, we aim to translate these developments to the domain of cardiac surgery, which is marked by heterogeneity and complexity of surgical cases. To this end, we curate a dataset of cardiac surgery lecture videos and augment the training dataset by using a Large Language Model (LLM) to extract procedural steps for each surgery. Preliminary results suggest that this form of data augmentation can enhance model performance on text-based video retrieval tasks.
Kostiuchik et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: