Video prediction is a fundamental task in computer vision with broad applications in intelligent robotics, autonomous driving, and related fields. However, existing methods often struggle to simultaneously model long-term temporal dependencies, preserve local details, and alleviate error accumulation during autoregressive prediction. To address these issues, this paper proposes a two-stage video prediction framework composed of a HybridResSwin Autoencoder (HRS-AE) and an Enhanced FAR Transformer (EFAR). In the first stage, HRS-AE learns compact and discriminative latent representations from input video frames while preserving essential spatial structures and fine-grained details. In the second stage, EFAR performs autoregressive temporal prediction in the latent space, and the predicted latent representations are then decoded to reconstruct future video frames. Experiments on the KTH, BAIR, and Moving MNIST datasets show that the proposed method achieves competitive performance under the adopted evaluation protocol. Specifically, the proposed framework achieves a PSNR of 30.27 dB and an LPIPS of 0.0722 on KTH, a PSNR of 20.95 dB on BAIR, and an SSIM of 0.961 with an MSE of 22.9 on Moving MNIST. In addition, ablation studies further indicate that the proposed components contribute to latent representation learning and long-horizon prediction stability. These results suggest that the proposed framework provides a promising approach for video prediction with favorable reconstruction quality, perceptual consistency, and temporal coherence.
Zhang et al. (Fri,) studied this question.