Key points are not available for this paper at this time.
Video Transformers (VidTs) have reached the forefront of accuracy in various video understanding tasks. Despite their remarkable achievements, the processing requirements for a large number of video frames still present a significant performance bottleneck, impeding their deployment to resource-constrained platforms. While accelerators meticulously designed for Vision Transformers (ViTs) have emerged, they may not be the optimal solution for VidTs, primarily due to two reasons. These accelerators tend to overlook the inherent temporal redundancy that characterizes VidTs, limiting their chance for further performance enhancement. Moreover, incorporating a sparse attention prediction module within these accelerators incurs a considerable overhead.
Building similarity graph...
Analyzing shared references across papers
Loading...
Song et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e6e1dcb6db64358765d60e — DOI: https://doi.org/10.1145/3620665.3640393
Zhuoran Song
Chunyu Qi
Fangxin Liu
Shanghai Jiao Tong University
Building similarity graph...
Analyzing shared references across papers
Loading...