Key points are not available for this paper at this time.
Event cameras, offering high temporal resolutions and high dynamic ranges, have brought a new perspective to address common challenges in monocular depth estimation (e.g., motion blur and low light). However, existing CNN-based methods insufficiently exploit global spatial information from asynchronous events, while RNN-based methods show a limited capacity for effective temporal cues utilization for event-based monocular depth estimation. To this end, we propose a event-based monocular depth estimator with recurrent transformers, namely EReFormer. Technically, we first design a transformer-based encoder-decoder that utilizes multi-scale features to model global spatial information from events. Then, we propose a Gate Recurrent Vision Transformer (GRViT), introducing a recursive mechanism into transformers, to leverage rich temporal cues from events. Finally, we present a Cross Attention-guided Skip Connection (CASC), performing cross attention to fuse multi-scale features, to improve global spatial modeling capabilities. The experimental results show that our EReFormer outperforms state-of-the-art methods by a margin on both synthetic and real-world datasets. Our open-source code is available at https://github.com/liuxu0303/EReFormer.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xu Liu
Jianing Li
Jinqiao Shi
IEEE Transactions on Circuits and Systems for Video Technology
Peking University
Harbin Institute of Technology
Beijing University of Posts and Telecommunications
Building similarity graph...
Analyzing shared references across papers
Loading...
Liu et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e7376bb6db6435876b0f88 — DOI: https://doi.org/10.1109/tcsvt.2024.3378742