In the rapidly evolving landscape of digital communication, researchers are increasingly focusing on sign language generation as a means to bridge the communication divide for the Deaf and Hard-of-Hearing communities worldwide. Modeling an efficient sign language video generation framework presents two main challenges: ensuring temporal coherence and handling diverse, complex data distributions, especially given the limited availability of annotated sign video data. To address these challenges, we propose SignMoDP, a sign language video generation model that incorporates a mixture of diffusions for universal sign language generation. We improve temporal coherence by integrating the State Space Model (SSM) into the video diffusion framework, ensuring smooth frame transitions and a natural flow of sign language. By utilizing video-to-patch conversion and a multilingual text encoder, the Sign State Space Model efficiently handles long sequences and complex tasks. Additionally, we introduce the Mixture of Diffusion Paths module to manage diverse data distributions, utilizing expert routing strategies for distinct data regions. We propose a Patch Mixture of Experts (Patch MoE), optimized with low-rank strategies, to further ensure robustness and adaptability in sign language video generation. Evaluations on RWTH-2014 , AUTSL, CSL-Daily, WLASL and LSFB benchmarks demonstrate state-of-the-art performance, positioning SignMoDP as a universal solution for scalable, inclusive sign language generation. More details can be found on our project page: https://github.com/mingtiannihao/SignMoDP.
Qi et al. (Thu,) studied this question.