What question did this study set out to answer?

The aim is to develop a model that accurately predicts viewport trajectories in 360° video streaming by addressing limitations of existing methods.

March 25, 2026Open Access

DiffVP: A Diffusion Model with Explicit Coordinate-Temporal Encoding for Viewport Prediction in 360∘ Videos

Key Points

The aim is to develop a model that accurately predicts viewport trajectories in 360° video streaming by addressing limitations of existing methods.
Proposed DiffVP diffusion model for viewport predictions.
Utilized Denoising Diffusion Implicit Models (DDIMs) for generating probability distributions of future trajectories.
Implemented Explicit Coordinate-Time Encoding (ECTE) to capture temporal dependencies and spatial relationships.
Developed Coordinate-Aware Saliency Features Fusion (CASF) module for integrating saliency and trajectory features.
DiffVP achieved the highest accuracy for 2-5 second viewport predictions.
Performance for short-term (<1 second) predictions remained strong.
Demonstrated effective handling of randomness in user viewing behavior.

Abstract

Viewport prediction is a key component in tile-based 360° video streaming. Existing viewport prediction models based on Long Short-term Memory Networks (LSTM) or Transformer typically output a single deterministic future trajectory through deterministic mapping, which fails to capture the inherent randomness in viewing behavior. Moreover, when encoding trajectory features, such models often map trajectory coordinates directly into a high-dimensional space while neglecting the spatial information inherent in the coordinates themselves. Additionally, they exhibit limitations in capturing cross-modal relationships between visual and trajectory features. To address these issues, this paper proposes DiffVP, a diffusion model for viewport prediction in 360° videos. Under the constraints of viewing historical trajectories and video saliency maps, DiffVP leverages Denoising Diffusion Implicit Models (DDIMs) to model future viewing trajectories in the form of probability distributions, generating diverse and reasonable prediction results. In the denoising network, DiffVP employs Explicit Coordinate-Time Encoding (ECTE) to model the temporal dependencies of trajectories and the spatial relationships among coordinates; moreover, a Coordinate-Aware Saliency Features Fusion (CASF) module is proposed to achieve cross-modal alignment and interactive fusion of saliency and trajectory features. Experimental results on three public datasets demonstrate that DiffVP achieves the best accuracy for 2–5 s viewport prediction without sacrificing the performance of short-term (<1 s) prediction.

DiffVP: A Diffusion Model with Explicit Coordinate-Temporal Encoding for Viewport Prediction in 360∘ Videos

Key Points

Abstract

Cite This Study