Action recognition in videos is an important task in computer vision, widely used in sports, healthcare, and human-computer interaction. Existing methods often struggle to balance global motion understanding and local detail extraction, especially when dealing with rapid transitions or combinations of multiple actions. This paper proposes a new action recognition model, ViT-ConvGAN, which integrates Video Transformer (ViT), 3D Convolutional Neural Networks (CNN), and Conditional Generative Adversarial Networks (CGAN). Input video frames are first processed by ViT, which captures long-term temporal dependencies via spatiotemporal attention to generate global spatiotemporal features; these global features are then fed into the 3D CNN, which refines local motion details and fuses them with the global features to form a comprehensive feature map; finally, the fused feature map is transmitted to the CGAN–where the generator optimizes feature representation for more discriminative action characteristics, and the discriminator enhances the distinction between different action categories to improve prediction accuracy. ViT models long-term temporal dependencies through spatiotemporal attention, 3D CNN extracts local motion features, and CGAN optimizes action predictions to enhance the reliability of classification results. The model excels in capturing both global and local motion patterns, especially for complex action sequences. Experiments on the UCF101 and Kinetics-400 datasets show that ViT-ConvGAN achieves 87.3% and 95.2% Top-1 accuracy, respectively, with strong performance on Kinetics-400, surpassing several state-of-the-art models. Ablation studies confirm the contribution of each module, particularly the critical role of ViT and 3D CNN in feature extraction. ViT-ConvGAN provides an efficient solution, improving complex action recognition performance and offering new insights for model architecture design in action analysis.
Miao et al. (Fri,) studied this question.