The rapid developments of artificial intelligence have significantly impacted daily life and content production modes. In the field of video generation, researchers are now exploring this emerging technique with innovative approaches, aiming to produce videos of higher quality, longer duration, and greater diversity. Currently, numerous video generation algorithms have been developed using different architecture designs. Unlike image generation, video generation requires maintaining consistency across both spatial and temporal dimensions while ensuring aesthetic quality and dynamic coherence, making it a more challenging task. In this survey, we provide a systematic review of existing video generation methods, tracing their evolution across different architectural paradigms. We further categorize recent models by their control conditions (e.g., text-to-video, image-to-video, multi-modal guidance) and summarize their unique theoretical foundations, architectural designs, and algorithmic innovations. In the meantime, we review the commonly used video datasets and analyze their applicability to different tasks. We also present evaluations of representative models to offer a more comprehensive perspective. Our goal is to provide a clear and concise overview of these algorithms, offering insights to support future breakthroughs in video generation.
ZHANG et al. (Wed,) studied this question.