Visual–Inertial Odometry (VIO) is a foundational technology for autonomous navigation and robotics. However, existing deep learning-based methods face key challenges in temporal modeling and computational efficiency. Conventional approaches, such as Long Short-Term Memory (LSTM) networks and Transformers methods, often struggle to handle dependencies across different temporal scales while causing high computational costs. To address these issues, this work introduces Receptance Weighted Key Value (RWKV)-VIO, a novel framework based on the RWKV architecture. The proposed framework is designed with a lightweight structure and linear computational complexity, which effectively reduces the computational burden in temporal modeling. Furthermore, a newly developed Inertial Measurement Unit (IMU) encoder is included to improve the effectiveness of feature extraction using residual connections and channel alignment, allowing the efficient use of historical inertial data. A parallel encoding strategy uses two independently initialized encoders. Features are extracted from different dimensions by this strategy, strengthening the model’s ability to detect complex patterns. Experimental results for publicly shared datasets show that RWKV-VIO prioritizes computational efficiency and lightweight design. It significantly reduces model size and inference time compared to existing advanced methods while achieving top-ranked positioning accuracy among evaluated approaches.
Yang et al. (Mon,) studied this question.