With the widespread adoption of 360-degree video in virtual reality environments, users exhibit complex and dynamic gaze behaviors. Existing viewport prediction methods suffer from insufficient spatio-temporal feature extraction, weak modeling of global dependencies, and poor prediction stability during long-term forecasting. To address these challenges, this paper proposes a long-term viewport prediction model based on a convolutional multi-head attention mechanism. Trained offline using historical viewing data from other users, this model incorporates convolutional multi-head attention to simultaneously model contextual dependencies, global features, and temporal dynamics, thereby enhancing long-term prediction accuracy and stability. Additionally, a dilated SE convolutional module is designed to expand the receptive field and adaptively recalibrate channels, strengthening the model’s multi-scale feature representation and local detail capture capabilities. Experimental results demonstrate that the proposed model maintains an average prediction accuracy of 96% across different prediction windows, exhibiting excellent stability and reliability.
Zhang et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: