Depth estimation algorithms are widely applied in various fields, including 3D reconstruction, autonomous driving, and industrial robotics. Monocular self-supervised algorithms for depth prediction offer a cost-effective alternative to acquiring depth through hardware devices such as LiDAR. However, current depth prediction networks, predominantly based on conventional encoder–decoder architectures, often encounter two critical limitations: insufficient feature fusion mechanisms during the upsampling phase and constrained receptive fields. These limitations result in the loss of high-frequency details in the predicted depth maps. To overcome these issues, we introduce differential attention operators to enhance global feature representation and refine locally upsampled features within the depth decoder. Furthermore, we equip the decoder with a deformable bin-structured prediction head; this lightweight design enables per-pixel dynamic aggregation of local depth distributions via adaptive receptive field modulation and deformable sampling, enhancing the decoder’s fine-grained detail processing by capturing local geometry and holistic structures. Experimental results on the KITTI and Make3D datasets demonstrate that our proposed method produces more accurate depth maps with finer details compared to existing approaches.
Zhou et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: