Rapid and accurate tactical decision-making is required in modern football. However, traditional approaches for the assessment of tactical decision-making have poor ecological validity and do not easily scale. In order to address these challenges, this article introduces a Transformer-Based Multimodal Fusion model that incorporates player positioning, video, audio, and contextual metadata to classify real-time tactical decisions. Experiments were conducted with 40 male players and a dataset comprising 500 sequences of multimodal plays. The 40-player dataset refers to controlled laboratory-style decision-making experiments used for initial validation and a reliability assessment. Then, 500 multimodal sequences were extracted from extended match simulations and real-game recordings to provide the larger dataset used in training and testing the multimodal transformer model. It processes inputs in five stages: data acquisition, preprocessing, feature extraction, transformer-based fusion, and decision classification. Compared to the baselines of CNN-LSTM, BiLSTM-Attention, and GNN, the proposed approach improves the accuracy of decision prediction by 28% and reduces misclassification caused by pressure by 41%, with low inference latency of 52.6 ms, making it suitable for near-real-time applications. The generalizability of findings across more diverse tactical contexts and to wider athlete demographics is also limited by the relatively small size and homogeneity within the sample population of young male players from a single region. These results emphasize the contribution of transformer-based multimodal fusion toward automated tactical decision analysis and point out the need for its further validation in more diverse and large-scale match situations.
Yang et al. (Tue,) studied this question.