What question did this study set out to answer?

The aim is to improve video action recognition using a lightweight adaptation method without degrading spatial representations.

April 19, 2026Open Access

VETA-CLIP: Lightweight Video Adaptation with Efficient Spatio-Temporal Attention and Variation Loss

Key Points

The aim is to improve video action recognition using a lightweight adaptation method without degrading spatial representations.
Proposed VETA-CLIP framework with lightweight adapters into a frozen backbone.
Implemented Efficient Spatio-Temporal Attention mechanism decoupling spatial and temporal heads.
Introduced Variation Loss to emphasize action-related changes in video frames.
Achieved competitive performance on HMDB-51, UCF-101, and Something-Something v2 datasets.
Maintained competitiveness on the Kinetics-400 dataset with only 3.55M trainable parameters.
Demonstrated exceptional computational efficiency with 4.7 GB peak GPU memory usage and 2.47 ms inference per video.

Abstract

Full fine-tuning of large-scale vision-language models for video action recognition incurs prohibitive computational cost and often degrades pre-trained spatial representations. To address this, we propose VETA-CLIP, a Video Efficient Temporal Adaptation framework that enhances temporal modeling while preserving cross-modal alignment. By incorporating lightweight adapters into a frozen backbone, VETA-CLIP introduces only 3.55M trainable parameters (a 98% reduction compared to full fine-tuning). Our approach features two key innovations: (1) an Efficient Spatio-Temporal Attention (ESTA) mechanism with a parameter-free boundary replication temporal shift (BRTS) module, which explicitly decouples spatial and temporal attention heads to capture inter-frame dynamics while minimizing disruption to the pre-trained spatial representations; and (2) a novel Variation Loss that maximizes both local inter-frame differences and global temporal variance, encouraging the model to focus on action-related changes rather than static backgrounds. Extensive experiments on HMDB-51, UCF-101, and Something-Something v2 demonstrate that VETA-CLIP achieves competitive performance across zero-shot, base-to-novel, and few-shot protocols, while and remains competitive on the Kinetics-400 dataset. Notably, our eight-frame variant requires only 4.7 GB of peak GPU memory and 2.47 ms of inference per video, demonstrating exceptional computational efficiency alongside consistent accuracy gains.

VETA-CLIP: Lightweight Video Adaptation with Efficient Spatio-Temporal Attention and Variation Loss

Key Points

Abstract

Cite This Study

Also Consider

Also Consider