What question did this study set out to answer?

This work aims to enhance video frame prediction through improved modeling of spatial structures and temporal dependencies.

May 6, 2026Open Access

A Video Frame Prediction Method Based on Latent-Space Autoregressive Modeling

Key Points

This work aims to enhance video frame prediction through improved modeling of spatial structures and temporal dependencies.
Developed a two-stage framework combining HybridResSwin Autoencoder and Enhanced FAR Transformer.
In the first stage, compact latent representations are learned from input video frames.
The second stage involves autoregressive temporal predictions in the latent space.
Achieved a PSNR of 30.27 dB on KTH and a PSNR of 20.95 dB on BAIR.
Obtain an SSIM of 0.961 and an MSE of 22.9 on Moving MNIST.
Studies indicate that the method contributes to latent representation learning and stability.

Abstract

Video prediction is a fundamental task in computer vision with broad applications in intelligent robotics, autonomous driving, and related fields. However, existing methods often struggle to simultaneously model long-term temporal dependencies, preserve local details, and alleviate error accumulation during autoregressive prediction. To address these issues, this paper proposes a two-stage video prediction framework composed of a HybridResSwin Autoencoder (HRS-AE) and an Enhanced FAR Transformer (EFAR). In the first stage, HRS-AE learns compact and discriminative latent representations from input video frames while preserving essential spatial structures and fine-grained details. In the second stage, EFAR performs autoregressive temporal prediction in the latent space, and the predicted latent representations are then decoded to reconstruct future video frames. Experiments on the KTH, BAIR, and Moving MNIST datasets show that the proposed method achieves competitive performance under the adopted evaluation protocol. Specifically, the proposed framework achieves a PSNR of 30.27 dB and an LPIPS of 0.0722 on KTH, a PSNR of 20.95 dB on BAIR, and an SSIM of 0.961 with an MSE of 22.9 on Moving MNIST. In addition, ablation studies further indicate that the proposed components contribute to latent representation learning and long-horizon prediction stability. These results suggest that the proposed framework provides a promising approach for video prediction with favorable reconstruction quality, perceptual consistency, and temporal coherence.

A Video Frame Prediction Method Based on Latent-Space Autoregressive Modeling

Key Points

Abstract

Cite This Study