ABSTRACT Monocular depth estimation is a fundamental task in computer vision that aims to infer scene depth from a single RGB image. With the rapid progress of deep learning, self‐supervised learning has become a prominent paradigm by exploiting photometric reconstruction and geometric constraints without requiring ground‐truth depth annotations. This paper provides a comprehensive review of self‐supervised monocular depth estimation methods with a clear methodological taxonomy. Specifically, existing approaches are systematically categorized according to their supervision construction strategies into stereo image‐pair‐based and monocular video‐based paradigms, and their core assumptions, learning pipelines, and applicability are analysed in detail. Furthermore, this survey organizes and discusses optimization strategies that improve robustness and generalization, including masking‐based optimization for dynamic scenes, multi‐modal fusion with auxiliary cues, adversarial learning‐based depth refinement, and lightweight model design for efficiency‐constrained scenarios. Finally, the paper summarizes key technical challenges faced by current self‐supervised monocular depth estimation methods and outlines potential future research directions towards more robust and practical depth estimation systems.
Xie et al. (Thu,) studied this question.