What question did this study set out to answer?

The aim is to improve video action recognition using a lightweight adaptation method without degrading spatial representations.

April 19, 2026Open Access

VETA-CLIP: Lightweight Video Adaptation with Efficient Spatio-Temporal Attention and Variation Loss

Key Points

The aim is to improve video action recognition using a lightweight adaptation method without degrading spatial representations.
Proposed VETA-CLIP framework with lightweight adapters into a frozen backbone.
Implemented Efficient Spatio-Temporal Attention mechanism decoupling spatial and temporal heads.
Introduced Variation Loss to emphasize action-related changes in video frames.
Achieved competitive performance on HMDB-51, UCF-101, and Something-Something v2 datasets.
Maintained competitiveness on the Kinetics-400 dataset with only 3.55M trainable parameters.
Demonstrated exceptional computational efficiency with 4.7 GB peak GPU memory usage and 2.47 ms inference per video.

Abstract

Full fine-tuning of large-scale vision-language models for video action recognition incurs prohibitive computational cost and often degrades pre-trained spatial representations. To address this, we propose VETA-CLIP, a Video Efficient Temporal Adaptation framework that enhances temporal modeling while preserving cross-modal alignment. By incorporating lightweight adapters into a frozen backbone, VETA-CLIP introduces only 3.55M trainable parameters (a 98% reduction compared to full fine-tuning). Our approach features two key innovations: (1) an Efficient Spatio-Temporal Attention (ESTA) mechanism with a parameter-free boundary replication temporal shift (BRTS) module, which explicitly decouples spatial and temporal attention heads to capture inter-frame dynamics while minimizing disruption to the pre-trained spatial representations; and (2) a novel Variation Loss that maximizes both local inter-frame differences and global temporal variance, encouraging the model to focus on action-related changes rather than static backgrounds. Extensive experiments on HMDB-51, UCF-101, and Something-Something v2 demonstrate that VETA-CLIP achieves competitive performance across zero-shot, base-to-novel, and few-shot protocols, while and remains competitive on the Kinetics-400 dataset. Notably, our eight-frame variant requires only 4.7 GB of peak GPU memory and 2.47 ms of inference per video, demonstrating exceptional computational efficiency alongside consistent accuracy gains.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper