Abstract Purpose: Monocular depth and pose estimation play an essential role in developing colonoscopy-assisted navigation systems. Accurate geometric understanding can reduce blind spots, lower the risk of missed or recurrent lesions, and prevent incomplete examinations. However, this task remains challenging due to texture-less surfaces, complex illumination, tissue deformation, and the scarcity of in-vivo datasets with reliable ground truth. Methods: We propose a self-supervised learning framework that leverages anatomical and illumination priors to guide geometric learning. The framework integrates two complementary cues: (1) edge maps, obtained from a learning-based detector trained to capture thin and high-frequency mucosal boundaries, and (2) luminance decomposition, produced through intrinsic image separation that isolates shading from reflectance. These modalities provide structural and photometric guidance to the pose and depth networks, with an edge-guided loss applied in a stage-wise refinement that enhances motion alignment while preserving depth consistency. Results: Experiments on phantom (C3VD) and real (EndoMapper) datasets demonstrate state-of-the-art performance in depth estimation and competitive accuracy in pose estimation. Ablation analyses further evaluate the influence of training domain, temporal sampling, and supervision type. Two practical findings emerge: Self-supervised training on real data outperforms supervised training on phantom data, highlighting the importance of domain realism; and dataset-specific frame-rate sampling is critical for generating effective training sequences. Conclusion: The proposed framework enhances geometric learning in endoscopic videos by incorporating structure- and illumination-aware cues, providing a robust foundation for reliable, marker-free colonoscopy navigation. The code and pretrained models are publicly available at: https://github.com/XinweiJu/PRISM .
Ju et al. (Sun,) studied this question.