This paper presents the Realistic Image Animation (RIA-Net), a novel framework that leverages semantic-aware feature learning and adversarial training to generate high-quality animations from a single static image and a driving video. Unlike traditional keypoint-based methods that often suffer from local distortions and temporal instability, RIA-Net introduces a transformer-based architecture integrated with landmark and keypoint detection to preserve semantic details and capture long-range motion dynamics. The proposed semantic-aware transformer explicitly models global dependencies and predictive spatiotemporal relationships, enabling smooth and temporally consistent animations. Extensive experiments on diverse datasets, including VoxCeleb, TaiChiHD, and TED-Talks, demonstrate that RIA-Net consistently outperforms state-of-the-art methods in terms of animation quality, temporal coherence, and visual fidelity. This work opens new opportunities for realistic image animation in applications such as entertainment, virtual reality, and digital content creation.
Nega et al. (Mon,) studied this question.