Translating spoken words into emotionally and contextually aligned video content remains an open challenge in generative AI. Subtle vocal patterns—such as pauses and pitch modulations—often obscure emotional cues, resulting in visuals that feel emotionally disconnected or flat. While several models excel at text-to-image generation, they struggle with interpreting speech-based inputs, often misreading paralinguistic cues and contextual intent. To address these limitations, this research introduces EchoVid, an audio-to-video synthesis model designed to prioritize contextual fidelity and emotional alignment. A scalable web interface built with React.js and TypeScript connects to Node.js backend with the MongoDB Atlas for near-real-time generation (≈ 1.2× input-duration latency at 512² frames on RTX A4000) interaction. Using PyAudio input, EchoVid guides Hugging Face’s Stable Diffusion v2.1 via emotion-aware prompts, with CNN-enhanced diffusion transformers supporting the video generation process. Preliminary results show that EchoVid can generate visuals that reflect both emotional tone (e.g., joyful imagery for upbeat speech) and context. The proposed EchoVid model is compared with MoCoGAN and Stable Video Diffusion variants based on metrics like, FVD, FID-VID and CLIPScore. Further this research introduces two novel evaluation metrics namely Temporal Semantic Stability (TSS) and Perceptual Flicker Index (PFI) that scores the semantic consistency and frame-to-frame change in the generated video. The results show that EchoVid outperforms the other models and can generate relatively better videos.
Dharrao et al. (Tue,) studied this question.