What type of study is this?

This is a Literature Review study (also classified as: Quantitative Study).

October 8, 2025Open Access

Towards Human-Centered and Efficient Video Synthesis: A Survey of Multimodal Diffusion Models

Key Points

Transformative multimodal video diffusion models present significant potential for controlled video synthesis, yet face challenges in temporal consistency.
The analysis identifies trade-offs in computational efficiency and generation quality, with temporal block pruning achieving significant computational savings.
Human-centric applications reveal issues with identity preservation and uncanny valley effects, complicating the motion synthesis process.
MIME-Vid introduces a novel framework to enhance temporal consistency and realism, paving the way for future advancements in video generation.

Abstract

Abstract Multimodal video diffusion models have emerged as transformative tools for controlled video synthesis, integrating text, images, audio, and pose sequences to generate semantically meaningful content. Despite significant advances, critical gaps persist in temporal consistency, multimodal alignment, and human-centric motion generation. Existing surveys have not addressed clearly the complex interplay between these components, particularly physiological constraints and identity preservation in human motion synthesis. This survey provides a comprehensive analysis through a unified architectural framework, examining spatial-temporal representations and multimodal conditioning mechanisms. We present the first systematic evaluation of human-centric motion modeling, addressing physiological plausibility and identity consistency challenges. Our analysis reveals fundamental trade-offs between computational efficiency and generation quality, demonstrating that specialized techniques like temporal block pruning achieve 523× computational savings with minimal quality degradation. Key findings indicate that current approaches struggle with seamless multimodal integration, human-centric applications face "uncanny valley" effects when physics constraints are too rigid, and identity preservation conflicts with motion dynamics. We introduce MIME-Vid (Multi-modal Integration with Motion Enhancement for Video Generation), a novel framework that integrates advanced Kalman filtering techniques with multi-modal architecture for enhanced temporal consistency and motion realism. Furthermore, we propose novel evaluation paradigms and identify future research directions for advancing multimodal video generation

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Albaghdadi et al. (Tue,) studied this question.

synapsesocial.com/papers/68e6a0f4718ef0a556b33cea https://doi.org/https://doi.org/10.21203/rs.3.rs-7533477/v1

Bookmark

View Full Paper