What question did this study set out to answer?

The aim is to develop a lightweight approach for accurate metric depth estimation from streaming video using relative depth models.

April 11, 2026

Metric-FlashDepth: Extending Streaming Video Relative Depth Estimation Model to Metric Depth via Scale and Shift Prediction

Key Points

The aim is to develop a lightweight approach for accurate metric depth estimation from streaming video using relative depth models.
Extended FlashDepth model for monocular metric depth estimation.
Introduced a temporal scale/shift predictor using multi-layer CLS token fusion from DINOv2.
Employed a canonical space transformation module for varying camera intrinsics.
Utilized an attention-weighted loss function for improved model stability.
Achieved metric depth estimation with only a minimal 2-3 FPS reduction in performance.
Outperformed Video Depth Anything while retaining accuracy comparable to image-based models.
Demonstrated effective use in applications like autonomous driving, AR/VR, and robotics.

Abstract

This study presents Metric-FlashDepth (MF Depth), which is a lightweight extension of FlashDepth that enables monocular metric depth estimation while preserving the real-time performance for streaming videos. While FlashDepth achieves real-time depth estimation for 2K resolution streaming videos, it cannot estimate actual distances required for downstream applications such as autonomous driving, AR/VR, and robotics. The key insight this paper provides is that relative depth models implicitly learn scale/shift-invariant representations that can be converted to metric depth through appropriate scale and shift parameters. This research proposes a temporal scale/shift predictor that leverages multi-layer CLS token fusion from DINOv2 (ViT) to predict scale and shift values conditioned on temporal context from previous frames with only 1.25M additional parameters (0.4% of MF Depth). To handle varying camera intrinsics across datasets, the model employs a canonical space transformation module adapted for MF Depth. The attention-weighted loss function further ensures stable performance across different model configurations. The proposed approach successfully estimates metric depth with only a 2-3 FPS degradation compared to FlashDepth. Experiments on multiple datasets confirmed that the proposed method achieves state-of-the-art real-time video metric depth estimation, outperforming Video Depth Anything while maintaining competitive accuracy with image-based metric models. This work bridges the gap between fast relative depth estimation and accurate metric depth prediction, increasing the potential for real-time applications in various industries.

Ask AI

Mark Helpful

Bookmark

Relay