Modern surgeries are complex and cognitively demanding, creating a need for advanced tools to assist medical staff, reduce cognitive load, and ultimately improve patient outcome. Computational models with a holistic understanding of the surgical scene, interactions, and context hold the promise to support surgeons in this task, especially when including an egocentric perspective. With advances in learning-based machine perception from images, creating these models is within reach provided that corresponding data can be acquired. In this study, we explore the creation and processing of egocentric surgical video data, collected using a head-worn recording device, i.e., Meta's Project Aria glasses. Along with addressing challenges in data processing, we investigate the performance of image annotation pipelines to establish high-quality labels. To showcase tasks such a dataset enables, we then evaluate state-of-the-art segmentation and 3D human hand and body pose estimation models. Our results highlight the complexities of working in a real clinical environment and provide insights for future improvements in the curation of egocentric datasets of surgical activity.
Yavuz et al. (Fri,) studied this question.