The rapid evolution of diffusion models has shifted visual synthesis from text-only inputs to precisely controlled generation driven by multi-source heterogeneous sensor signals (e.g., audio, 3D, and physiological data). This paper presents a systematic review of cross-modal mapping and controllable generation under multi-source collaboration. More precisely, we propose a unified “cross-modal mapping and injection” taxonomy by abstracting the intervention logic of heterogeneous signals. Fundamentally, we analyze these mechanisms in a backbone-agnostic manner, delineating the architectural transition from legacy U-Net dependencies to scalable architectures like Diffusion Transformers (DiTs) and tracing the technical evolution from single-source atomic driving to complex multi-source collaborative paradigms. Our mechanistic analysis reveals that seamless feature fusion heavily relies on gradient conflict resolution, rigorous arbitration, and dynamic disentanglement under multi-constraint scenarios. Furthermore, by systematizing current evaluation metrics, we identify intrinsic quality-controllability trade-offs through performance game analysis (e.g., Pareto optimization), yielding a scientifically grounded technical selection guide. The study concludes that overcoming current generation limitations necessitates integrating Hardware-in-the-Loop (HIL) deployment, PDE-driven physical constraints, and causal inference, laying the foundation for next-generation robust and real-time generative models.
Chen et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: