What question did this study set out to answer?

This research aims to enhance pose estimation accuracy in monocular visual odometry by leveraging spatio-temporal correlations.

May 14, 2026Open Access

Self-Supervised Sequence Learning Framework for Monocular Visual Odometry with Spatio-Temporal Correlation

Puntos clave

This research aims to enhance pose estimation accuracy in monocular visual odometry by leveraging spatio-temporal correlations.
Developed CoSTVO, a self-supervised monocular visual odometry framework.
Implemented a Cross-Connected Parallel Spatio-Temporal Extraction component.
Conducted experiments on KITTI, Malaga, and nuScenes datasets.
Achieved a 16.4% improvement in average translational performance over state-of-the-art methods.
Attained a 38.0% enhancement in average rotational performance compared to existing approaches.

Resumen

Visual odometry (VO) estimates the ego-motion of moving objects from consecutive image sequences captured by single or multiple on-board cameras. Most existing VO methods only estimate the pose from a pair of images and cannot model continuous image sequences. Our method, CoSTVO, focuses on self-supervised monocular VO for the outdoor autonomous driving scene with special attention to the simultaneous extraction of spatio-temporal correlations. CoSTVO designs a Cross-Connected Parallel Spatio-Temporal Extraction (CPST) component, key to which is expressing the extraction of tightly coupled spatio-temporal features in pose estimation as an encoder structure. The proposed CPST stores geometric structure and motion memory separately, and performs full integration to strengthen the coupling between the spatial features of the upper layer and the short-term memory of the previous moment. Additionally, CoSTVO directly extracts continuous image sequences to losslessly handle pixel movements, facilitating spatio-temporally consistent motion cues for complete feature representations. Experiments and evaluation are performed on the KITTI, Malaga, and nuScenes datasets. Quantitative experimental results demonstrate that the average translational and rotational performance of our method outperforms state-of-the-art self-supervised methods by up to 16.4% and 38.0%, respectively. This research proposes a self-supervised monocular VO that has the ability to tightly couple spatio-temporal motion feature extraction to enhance pose accuracy based on a cross-connected parallel encoder structure.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo

Cite This Study

Liu et al. (Fri,) studied this question.

synapsesocial.com/papers/6a05685ca550a87e60a20f3a https://doi.org/https://doi.org/10.1016/j.cjme.2026.100321

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Me gusta

Guardar

Ver artículo completo