What question did this study set out to answer?

To propose ViT-ConvGAN, a hybrid model that integrates Video Transformer, 3D CNN, and CGAN for effective action recognition in videos.

June 14, 2026Open Access

ViT-ConvGAN: a hybrid model for spatiotemporal action recognition using video transformer and 3D CNN

Key Points

To propose ViT-ConvGAN, a hybrid model that integrates Video Transformer, 3D CNN, and CGAN for effective action recognition in videos.
Integrates Video Transformer for long-term temporal feature extraction.
Employs 3D CNN to refine local motion details and fuse them with global features.
Utilizes Conditional Generative Adversarial Network to optimize feature representation and improve prediction accuracy.
Achieved 87.3% Top-1 accuracy on UCF101 and 95.2% on Kinetics-400 datasets.
ViT-ConvGAN performed better than several state-of-the-art models on Kinetics-400.
Ablation studies highlight the essential roles of ViT and 3D CNN in improving performance.

Abstract

Action recognition in videos is an important task in computer vision, widely used in sports, healthcare, and human-computer interaction. Existing methods often struggle to balance global motion understanding and local detail extraction, especially when dealing with rapid transitions or combinations of multiple actions. This paper proposes a new action recognition model, ViT-ConvGAN, which integrates Video Transformer (ViT), 3D Convolutional Neural Networks (CNN), and Conditional Generative Adversarial Networks (CGAN). Input video frames are first processed by ViT, which captures long-term temporal dependencies via spatiotemporal attention to generate global spatiotemporal features; these global features are then fed into the 3D CNN, which refines local motion details and fuses them with the global features to form a comprehensive feature map; finally, the fused feature map is transmitted to the CGAN–where the generator optimizes feature representation for more discriminative action characteristics, and the discriminator enhances the distinction between different action categories to improve prediction accuracy. ViT models long-term temporal dependencies through spatiotemporal attention, 3D CNN extracts local motion features, and CGAN optimizes action predictions to enhance the reliability of classification results. The model excels in capturing both global and local motion patterns, especially for complex action sequences. Experiments on the UCF101 and Kinetics-400 datasets show that ViT-ConvGAN achieves 87.3% and 95.2% Top-1 accuracy, respectively, with strong performance on Kinetics-400, surpassing several state-of-the-art models. Ablation studies confirm the contribution of each module, particularly the critical role of ViT and 3D CNN in feature extraction. ViT-ConvGAN provides an efficient solution, improving complex action recognition performance and offering new insights for model architecture design in action analysis.

Demander à l'IA

Bookmark

View Full Paper

Cite This Study

Miao et al. (Fri,) studied this question.

synapsesocial.com/papers/6a2e4753b1cc60ccdea8be05 https://doi.org/https://doi.org/10.1038/s41598-026-56006-6

Demander à l'IA

Bookmark

View Full Paper