Full fine-tuning of large-scale vision-language models for video action recognition incurs prohibitive computational cost and often degrades pre-trained spatial representations. To address this, we propose VETA-CLIP, a Video Efficient Temporal Adaptation framework that enhances temporal modeling while preserving cross-modal alignment. By incorporating lightweight adapters into a frozen backbone, VETA-CLIP introduces only 3.55M trainable parameters (a 98% reduction compared to full fine-tuning). Our approach features two key innovations: (1) an Efficient Spatio-Temporal Attention (ESTA) mechanism with a parameter-free boundary replication temporal shift (BRTS) module, which explicitly decouples spatial and temporal attention heads to capture inter-frame dynamics while minimizing disruption to the pre-trained spatial representations; and (2) a novel Variation Loss that maximizes both local inter-frame differences and global temporal variance, encouraging the model to focus on action-related changes rather than static backgrounds. Extensive experiments on HMDB-51, UCF-101, and Something-Something v2 demonstrate that VETA-CLIP achieves competitive performance across zero-shot, base-to-novel, and few-shot protocols, while and remains competitive on the Kinetics-400 dataset. Notably, our eight-frame variant requires only 4.7 GB of peak GPU memory and 2.47 ms of inference per video, demonstrating exceptional computational efficiency alongside consistent accuracy gains.
Huang et al. (Fri,) studied this question.