What question did this study set out to answer?

The central aim is to improve monocular depth and pose estimation for colonoscopy navigation using a self-supervised learning framework.

May 12, 2026Open Access

Multi-modal monocular endoscopic depth and pose estimation with edge-guided self-supervision

Key Points

The central aim is to improve monocular depth and pose estimation for colonoscopy navigation using a self-supervised learning framework.
Proposed a self-supervised learning framework leveraging anatomical and illumination priors.
Integrated edge maps from a learning-based detector and luminance decomposition for enhanced pose and depth networks.
Conducted experiments on phantom and real datasets, with ablation analyses on training methodologies and sampling rates.
Achieved state-of-the-art performance in depth estimation and competitive accuracy in pose estimation on evaluated datasets.
Self-supervised training on real data outperformed supervised training on phantom data, indicating domain realism importance.
Identified that dataset-specific frame-rate sampling is crucial for effective training sequence generation.

Abstract

Abstract Purpose: Monocular depth and pose estimation play an essential role in developing colonoscopy-assisted navigation systems. Accurate geometric understanding can reduce blind spots, lower the risk of missed or recurrent lesions, and prevent incomplete examinations. However, this task remains challenging due to texture-less surfaces, complex illumination, tissue deformation, and the scarcity of in-vivo datasets with reliable ground truth. Methods: We propose a self-supervised learning framework that leverages anatomical and illumination priors to guide geometric learning. The framework integrates two complementary cues: (1) edge maps, obtained from a learning-based detector trained to capture thin and high-frequency mucosal boundaries, and (2) luminance decomposition, produced through intrinsic image separation that isolates shading from reflectance. These modalities provide structural and photometric guidance to the pose and depth networks, with an edge-guided loss applied in a stage-wise refinement that enhances motion alignment while preserving depth consistency. Results: Experiments on phantom (C3VD) and real (EndoMapper) datasets demonstrate state-of-the-art performance in depth estimation and competitive accuracy in pose estimation. Ablation analyses further evaluate the influence of training domain, temporal sampling, and supervision type. Two practical findings emerge: Self-supervised training on real data outperforms supervised training on phantom data, highlighting the importance of domain realism; and dataset-specific frame-rate sampling is critical for generating effective training sequences. Conclusion: The proposed framework enhances geometric learning in endoscopic videos by incorporating structure- and illumination-aware cues, providing a robust foundation for reliable, marker-free colonoscopy navigation. The code and pretrained models are publicly available at: https://github.com/XinweiJu/PRISM .

Mark Helpful

Bookmark

Relay

View Full Paper