Depth estimation is a key perception task for robots and autonomous systems, with self-supervised monocular approaches gaining traction due to their independence from ground-truth labels. However, these methods often exhibit unstable training due to reliance on photometric consistency. This paper proposes an efficient self-supervised depth estimation framework that improves prediction accuracy while reducing computational cost. Training stability is enhanced through knowledge distillation using pseudo-labels from a foundation model, and a lightweight attention module is introduced to strengthen global spatial representation. Despite reducing model parameters by 40% and FLOPs by 20%, experiments on the KITTI Eigen split show improved absᵣel and performance compared to the baseline.
Han et al. (Mon,) studied this question.