Visual Simultaneous Localization and Mapping (VSLAM) is essential for autonomous systems, yet monocular implementations struggle with scale ambiguity and unreliable depth cues in textureless regions. We propose ViMGS-SLAM, a novel monocular framework that integrates a Multi-scale Vision Transformer (MViT) with 3D Gaussian Splatting (3DGS) to achieve real-time, metric-scale dense reconstruction. The MViT generates robust depth priors through a hierarchical pyramid architecture (three input scales, five feature levels), which are then used to initialize and constrain an explicit 3D Gaussian scene representation. A synchronous tracking–mapping pipeline with adaptive keyframe selection and anisotropic regularization jointly optimizes camera poses and Gaussian parameters. On the TUM RGB-D fr3 dataset, ViMGS-SLAM reduces absolute trajectory error by 46.0% (from 0.0437 m to 0.0236 m) compared to MonoGS. On Replica, it achieves state-of-the-art novel-view synthesis with PSNR 39.6 dB, SSIM 0.976, and LPIPS 0.042, outperforming both NeRF-based and 3DGS-based methods. The system operates at 2.7 FPS end-to-end in monocular mode, while the differentiable renderer alone reaches 1130 FPS, confirming its efficiency. Ablation studies validate the contributions of the MViT depth prior, the adaptive keyframing, and the regularization terms.
Zhu et al. (Mon,) studied this question.