The generation of video content for television production using automation and artificial intelligence-based techniques is quite common these days. The use of computer vision techniques plays a significant role in the classification and analysis of large volumes of multimedia content. This study aims to develop an intelligent framework for TV genre classification using deep learning and advanced transformer-based models. Traditional machine learning depends on traditional features and lacks the ability to capture complex spatio-temporal and acoustic relationships in modern media. To address these limitations, the study explores state-of-the-art vision transformers for deeper analysis on two standard datasets in the relevant domain. Firstly, a static image dataset is analyzed using the Pyramid Vision Transformer (PvT), which effectively captures multi-scale spatial and contextual information across diverse TV scenes. Secondly, a multimodal audio–video dataset is used by applying the Multimodal Attention and Invariant Vision–Audio Representation Transformer (MAiVAR-T). The applied model captures temporal dependencies and integrates acoustic features, including mel-spectrogram, chroma, waveform, and energy patterns. Empirical analysis demonstrates that the proposed PvT and MAiVAR-T models achieve the highest accuracies of 97% and 98%, respectively, outperforming the baseline deep learning models. This study presents the role of multimodal transformers in improving automated genre classification in television and digital media production.
Alarfaj et al. (Sun,) studied this question.