Current video generation models produce physically inconsistent motion that violates real-world dynamics. We propose TrajVLM-Gen, a two-stage framework for physics-aware image-to-video generation. First, we employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-world physics. Second, these trajectories guide video generation through attention-based mechanisms for fine-grained motion refinement. We build a trajectory prediction dataset based on video tracking data with realistic motion patterns. Experiments on UCF-101 and MSR-VTT demonstrate that TrajVLM-Gen outperforms existing methods, achieving competitive FVD scores of 545 on UCF-101 and 539 on MSR-VTT.
Building similarity graph...
Analyzing shared references across papers
Loading...
Fan Yang
Zhiyang Chen
Yousong Zhu
Building similarity graph...
Analyzing shared references across papers
Loading...
Yang et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68e25378d6d66a53c2474238 — DOI: https://doi.org/10.48550/arxiv.2510.00806
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: