This study presents Metric-FlashDepth (MF Depth), which is a lightweight extension of FlashDepth that enables monocular metric depth estimation while preserving the real-time performance for streaming videos. While FlashDepth achieves real-time depth estimation for 2K resolution streaming videos, it cannot estimate actual distances required for downstream applications such as autonomous driving, AR/VR, and robotics. The key insight this paper provides is that relative depth models implicitly learn scale/shift-invariant representations that can be converted to metric depth through appropriate scale and shift parameters. This research proposes a temporal scale/shift predictor that leverages multi-layer CLS token fusion from DINOv2 (ViT) to predict scale and shift values conditioned on temporal context from previous frames with only 1.25M additional parameters (0.4% of MF Depth). To handle varying camera intrinsics across datasets, the model employs a canonical space transformation module adapted for MF Depth. The attention-weighted loss function further ensures stable performance across different model configurations. The proposed approach successfully estimates metric depth with only a 2-3 FPS degradation compared to FlashDepth. Experiments on multiple datasets confirmed that the proposed method achieves state-of-the-art real-time video metric depth estimation, outperforming Video Depth Anything while maintaining competitive accuracy with image-based metric models. This work bridges the gap between fast relative depth estimation and accurate metric depth prediction, increasing the potential for real-time applications in various industries.
Yoon et al. (Thu,) studied this question.