Visual odometry (VO) estimates the ego-motion of moving objects from consecutive image sequences captured by single or multiple on-board cameras. Most existing VO methods only estimate the pose from a pair of images and cannot model continuous image sequences. Our method, CoSTVO, focuses on self-supervised monocular VO for the outdoor autonomous driving scene with special attention to the simultaneous extraction of spatio-temporal correlations. CoSTVO designs a Cross-Connected Parallel Spatio-Temporal Extraction (CPST) component, key to which is expressing the extraction of tightly coupled spatio-temporal features in pose estimation as an encoder structure. The proposed CPST stores geometric structure and motion memory separately, and performs full integration to strengthen the coupling between the spatial features of the upper layer and the short-term memory of the previous moment. Additionally, CoSTVO directly extracts continuous image sequences to losslessly handle pixel movements, facilitating spatio-temporally consistent motion cues for complete feature representations. Experiments and evaluation are performed on the KITTI, Malaga, and nuScenes datasets. Quantitative experimental results demonstrate that the average translational and rotational performance of our method outperforms state-of-the-art self-supervised methods by up to 16.4% and 38.0%, respectively. This research proposes a self-supervised monocular VO that has the ability to tightly couple spatio-temporal motion feature extraction to enhance pose accuracy based on a cross-connected parallel encoder structure.
Liu et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: