What question did this study set out to answer?

The aim is to develop a framework using deep learning and vision transformers for TV genre classification.

March 25, 2026Open Access

Exploring vision transformers for deep feature extraction and classification in video genre recognition for digital media

Key Points

The aim is to develop a framework using deep learning and vision transformers for TV genre classification.
Analyzed static image dataset with Pyramid Vision Transformer (PvT).
Used a multimodal audio–video dataset with MAiVAR-T to capture temporal dependencies.
Integrated acoustic features such as mel-spectrogram and waveform.
Achieved accuracy of 97% with PvT on static images.
Achieved accuracy of 98% with MAiVAR-T on audio-video data.
Outperformed baseline deep learning models in genre classification.

Abstract

The generation of video content for television production using automation and artificial intelligence-based techniques is quite common these days. The use of computer vision techniques plays a significant role in the classification and analysis of large volumes of multimedia content. This study aims to develop an intelligent framework for TV genre classification using deep learning and advanced transformer-based models. Traditional machine learning depends on traditional features and lacks the ability to capture complex spatio-temporal and acoustic relationships in modern media. To address these limitations, the study explores state-of-the-art vision transformers for deeper analysis on two standard datasets in the relevant domain. Firstly, a static image dataset is analyzed using the Pyramid Vision Transformer (PvT), which effectively captures multi-scale spatial and contextual information across diverse TV scenes. Secondly, a multimodal audio–video dataset is used by applying the Multimodal Attention and Invariant Vision–Audio Representation Transformer (MAiVAR-T). The applied model captures temporal dependencies and integrates acoustic features, including mel-spectrogram, chroma, waveform, and energy patterns. Empirical analysis demonstrates that the proposed PvT and MAiVAR-T models achieve the highest accuracies of 97% and 98%, respectively, outperforming the baseline deep learning models. This study presents the role of multimodal transformers in improving automated genre classification in television and digital media production.

Bookmark

View Full Paper

Cite This Study

Alarfaj et al. (Sun,) studied this question.

synapsesocial.com/papers/69c37aa8b34aaaeb1a67c8ad https://doi.org/https://doi.org/10.1038/s41598-026-45087-y

Bookmark

View Full Paper