What question did this study set out to answer?

This research focuses on addressing the limitations of monocular VSLAM systems regarding scale ambiguity and depth cues in textureless environments.

June 8, 2026Open Access

ViMGS-SLAM: A real-time monocular 3DGS-based SLAM via multiscale vision transformers

Key Points

This research focuses on addressing the limitations of monocular VSLAM systems regarding scale ambiguity and depth cues in textureless environments.
Developed ViMGS-SLAM framework integrating Multi-scale Vision Transformer and 3D Gaussian Splatting.
Implemented a synchronous tracking-mapping pipeline with adaptive keyframe selection and anisotropic regularization.
Evaluated performance on TUM RGB-D fr3 dataset and Replica for novel-view synthesis.
Reduced absolute trajectory error by 46.0% from 0.0437 m to 0.0236 m compared to MonoGS.
Achieved PSNR of 39.6 dB, SSIM of 0.976, and LPIPS of 0.042 for novel-view synthesis, surpassing previous methods.
Operated at 2.7 FPS in monocular mode, with differentiable renderer achieving 1130 FPS.

Abstract

Visual Simultaneous Localization and Mapping (VSLAM) is essential for autonomous systems, yet monocular implementations struggle with scale ambiguity and unreliable depth cues in textureless regions. We propose ViMGS-SLAM, a novel monocular framework that integrates a Multi-scale Vision Transformer (MViT) with 3D Gaussian Splatting (3DGS) to achieve real-time, metric-scale dense reconstruction. The MViT generates robust depth priors through a hierarchical pyramid architecture (three input scales, five feature levels), which are then used to initialize and constrain an explicit 3D Gaussian scene representation. A synchronous tracking–mapping pipeline with adaptive keyframe selection and anisotropic regularization jointly optimizes camera poses and Gaussian parameters. On the TUM RGB-D fr3 dataset, ViMGS-SLAM reduces absolute trajectory error by 46.0% (from 0.0437 m to 0.0236 m) compared to MonoGS. On Replica, it achieves state-of-the-art novel-view synthesis with PSNR 39.6 dB, SSIM 0.976, and LPIPS 0.042, outperforming both NeRF-based and 3DGS-based methods. The system operates at 2.7 FPS end-to-end in monocular mode, while the differentiable renderer alone reaches 1130 FPS, confirming its efficiency. Ablation studies validate the contributions of the MViT depth prior, the adaptive keyframing, and the regularization terms.

ViMGS-SLAM: A real-time monocular 3DGS-based SLAM via multiscale vision transformers

Key Points

Abstract

Cite This Study