Precise 3D perception is critical for indoor robotics, augmented reality, and autonomous navigation. However, existing multi-frame depth estimation methods often suffer from significant performance degradation in challenging indoor scenarios characterized by weak textures, non-Lambertian surfaces, and complex layouts. To address these limitations, we propose MonoPrior-Fusion (MPF), a novel framework that integrates pixel-wise monocular priors directly into the multi-view matching process. Specifically, MPF modulates cost-volume hypotheses to disambiguate matches and employs a hierarchical fusion architecture across multiple scales to propagate global and local geometric information. Additionally, a geometric consistency loss based on virtual planes is introduced to enhance global 3D coherence. Extensive experiments on ScanNetV2, 7Scenes, TUM RGB-D, and GMU Kitchens demonstrate that MPF achieves significant improvements over state-of-the-art multi-frame baselines and generalizes well across unseen domains. Furthermore, MPF yields more accurate and complete 3D reconstructions when integrated into a volumetric fusion pipeline, proving its effectiveness for dense mapping tasks. The source code will be made publicly available to support reproducibility and future research.
Lin et al. (Wed,) studied this question.