What question did this study set out to answer?

The aim is to enhance depth estimation for autonomous mobile robots by integrating RGB and LiDAR data through a two-stage autoencoder.

March 21, 2026Open Access

Distilling Apple DepthPro for RGB-LiDAR depth estimation

Key Points

The aim is to enhance depth estimation for autonomous mobile robots by integrating RGB and LiDAR data through a two-stage autoencoder.
Developed a two-stage autoencoder architecture for depth estimation.
Distilled Apple’s DepthPro model in the first stage to ensure structural integrity.
Incorporated LiDAR point clouds in the second stage for improved accuracy.
Tested three autoencoder variants with different multimodal fusion strategies.
Evaluated the architecture with real-world warehouse data.
Achieved improvements in depth estimation accuracy and perceptual quality.
Demonstrated robustness under varying scenes and lighting conditions.
All architecture variants showed enhanced performance with different fusion strategies.

Abstract

This work presents a two-stage autoencoder architecture for improving depth estimation in Autonomous Mobile Robot (AMR) applications by distilling Apple’s DepthPro model and integrating LiDAR data. The work addresses critical limitations in existing depth estimation technologies, particularly when applied to warehouse robotics, where accurate depth perception is essential for tasks like pallet picking and placing. The two-stage autoencoder combines the strengths of RGB-based depth estimation with sparse but accurate LiDAR measurements. The first stage involves knowledge distillation of the Apple DepthPro model to maintain structural integrity while creating a more efficient architecture suitable for mobile robots (ResNet18, ResNet50, MobileNetV2, Swin-T, ViT-B-16, and MobileNetV3-S). The second stage incorporates LiDAR point clouds projected to image space, in the loss function, to align depth estimation with real-world geometric measurements while preserving the structural integrity from the first stage. The two-stage architecture explores three variants of autoencoder designs with different multimodal fusion strategies: Variant I uses three independent encoders processing RGB, depth, and segmentation data simultaneously; Variant II employs two encoders handling bimodal pairs (RGB with depth or RGB with segmentation); and Variant III serves as a single encoder baseline using only RGB or depth data. Each variant is evaluated with both direct concatenation and attention-based feature fusion mechanisms. Evaluation was carried out with real-world data collected in a warehouse environment, where various combinations of architecture variants, fusion strategies, and loss function combinations were evaluated. The reported results demonstrate improvements in accuracy, perceptual quality, and robustness across varying scenes and lighting conditions, using the proposed two-stage approach. • Two-Stage depth estimation autoencoder architecture. In the first stage the depth estimation model is distilled from Apple’s DepthPro for structural geometry integrity. Second stage refines the estimated depth with accurate metric (LiDAR) values via fine-tuning, considering depth consistency and structural geometry. • Real-world evaluation using a dataset acquired with an AMR with onboard LiDAR and camera sensors, in a industrial warehouse.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper