Deep learning-based monocular depth estimation has achieved significant advancements on urban benchmarks, but its embedded application remains limited by efficiency constraints. Vision Transformers (ViTs) and Foundation Models (FMs) show promising zero-shot generalization capabilities, yet their adaptation to resource-constrained hardware requires careful study. In this work, we investigate the development of the DepthAnything model on an NVIDIA Jetson Orin, analyzing the trade-off between accuracy and inference speed for different backbones (ViT-S, ViT-B, and ViT-L). We report quantitative metrics including AbsRel, δ1, RMSE, and FPS on the KITTI dataset, along with qualitative results. Our experiments show that the ViT-S backbone offers the best balance of accuracy and real-time performance (44 FPS), whereas ViT-B suffers from degradation and ViT-L exhibits significant instability due to optimization artifacts. These findings highlight the viability of compact backbones for embedded visual perception and suggest future optimizations, such as quantization-aware training and pruning, in larger architectures.
Veramendi et al. (Mon,) studied this question.