ABSTRACT In robotic surgery, surgical gesture recognition has great importance in surgical quality evaluation and intelligent recognition assistance. Currently, deep learning models, such as recurrent neural networks and temporal convolutional networks, are mainly used to model action sequences and capture the temporal dependencies between them. However, some of these methods ignore the fusion of spatial and temporal features, and hence cannot effectively capture long‐term relationships and efficiently model action sequences. To overcome these limitations, we propose a spatiotemporal adaptive network (STANet) to fuse spatiotemporal features. Specifically, we designed a temporal module and a spatial module to extract respective features. Subsequently, these features were fused and further refined through temporal modeling using a temporal adaptive convolution strategy. This approach integrates both long‐term and short‐term characteristics of surgical gesture sequences. The organic combination of temporal and spatial modules was inserted into the backbone network to form the STANet, which efficiently modeled the action sequences. Our approach has been validated on the publicly available surgical gesture datasets JIGSAWS and RARP‐45, achieving very good results. Compared to other reported benchmark models, our model demonstrates exceptional performance. It can be used in surgical robots, visual feedback systems, and computer‐assisted surgery.
Building similarity graph...
Analyzing shared references across papers
Loading...
Boqiang Jia
Wenjie Wang
Xin Tian
Annals of the New York Academy of Sciences
Xi'an Polytechnic University
Building similarity graph...
Analyzing shared references across papers
Loading...
Jia et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68d6c671b1249cec298b1fea — DOI: https://doi.org/10.1111/nyas.70053