June 1, 2022

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing video architectures can only process 3,000% more compute to do the same. On a wide range of settings, the increased temporal support enabled by MeMViT brings large gains in recognition accuracy consistently. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets. Code and models will be made publicly available.

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

Puntos clave

Resumen

Cite This Study