What question did this study set out to answer?

This research aims to develop an efficient video generation model for sign language that addresses data challenges and enhances flow.

June 1, 2026

SignMoD: Sign Language Video Generation via Mixture of Diffusion

Key Points

This research aims to develop an efficient video generation model for sign language that addresses data challenges and enhances flow.
Proposed SignMoDP model integrating a mixture of diffusions for sign language generation.
Utilized state space model to enhance temporal coherence in video outputs.
Introduced Patch Mixture of Experts optimized with low-rank strategies for handling diverse data distributions.
Achieved state-of-the-art performance on RWTH-2014, AUTSL, CSL-Daily, WLASL, and LSFB benchmarks.
Demonstrated improved temporal coherence and flow in generated sign language videos.
Enhanced adaptability for diverse data distributions through expert routing strategies.

Abstract

In the rapidly evolving landscape of digital communication, researchers are increasingly focusing on sign language generation as a means to bridge the communication divide for the Deaf and Hard-of-Hearing communities worldwide. Modeling an efficient sign language video generation framework presents two main challenges: ensuring temporal coherence and handling diverse, complex data distributions, especially given the limited availability of annotated sign video data. To address these challenges, we propose SignMoDP, a sign language video generation model that incorporates a mixture of diffusions for universal sign language generation. We improve temporal coherence by integrating the State Space Model (SSM) into the video diffusion framework, ensuring smooth frame transitions and a natural flow of sign language. By utilizing video-to-patch conversion and a multilingual text encoder, the Sign State Space Model efficiently handles long sequences and complex tasks. Additionally, we introduce the Mixture of Diffusion Paths module to manage diverse data distributions, utilizing expert routing strategies for distinct data regions. We propose a Patch Mixture of Experts (Patch MoE), optimized with low-rank strategies, to further ensure robustness and adaptability in sign language video generation. Evaluations on RWTH-2014 , AUTSL, CSL-Daily, WLASL and LSFB benchmarks demonstrate state-of-the-art performance, positioning SignMoDP as a universal solution for scalable, inclusive sign language generation. More details can be found on our project page: https://github.com/mingtiannihao/SignMoDP.

Bookmark

SignMoD: Sign Language Video Generation via Mixture of Diffusion

Key Points

Abstract

Cite This Study