What type of study is this?

This is a Experimental Study study (also classified as: Quantitative Study).

October 5, 2025Open Access

From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

FYFan YangCentre for Artificial Intelligence and Robotics ZCZhiyang ChenBeijing Institute of Technology YZYousong ZhuShandong Institute of Automation

Key Points

TrajVLM-Gen demonstrates improved motion consistency in video generation compared to existing models.
Key performance metric includes an FVD score of 545 on UCF-101 and 539 on MSR-VTT, indicating competitive results.
The approach utilizes a Vision Language Model to predict motion trajectories that adhere to physical realism.
The model refines video generation using attention-based mechanisms to enhance fine-grained motion detail.

Abstract

Current video generation models produce physically inconsistent motion that violates real-world dynamics. We propose TrajVLM-Gen, a two-stage framework for physics-aware image-to-video generation. First, we employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-world physics. Second, these trajectories guide video generation through attention-based mechanisms for fine-grained motion refinement. We build a trajectory prediction dataset based on video tracking data with realistic motion patterns. Experiments on UCF-101 and MSR-VTT demonstrate that TrajVLM-Gen outperforms existing methods, achieving competitive FVD scores of 545 on UCF-101 and 539 on MSR-VTT.

KI fragen

Bookmark

View Full Paper