High-fidelity audio-driven talking-head generation holds significant promise for digital communication, online education, and the preservation of minority languages. However, applying existing diffusion-based frameworks to low-resource languages like Tibetan remains challenging, primarily due to data scarcity and the difficulty of generalizing to unseen prosodic patterns. Current state-of-the-art systems typically rely on Transformer-based temporal attention mechanisms. While effective for short sequences, these architectures suffer from quadratic complexity (O (T²) ), which limits the context window and hinders the modeling of long-range temporal dependencies. In cross-lingual transfer, these limitations can be further exacerbated, and models may struggle to maintain structural stability, leading to artifacts such as motion stiffness, background warping, and temporal flickering. To address these limitations, we present HimaTalk (Himalayan Talking Head), a unified framework designed to optimize generation quality, computational efficiency, and cross-lingual generalization. We replace the standard temporal attention layer with a Bidirectional Mamba Motion Module. By employing a dual scanning mechanism (forward and backward) with learnable gated fusion, this module captures global temporal context with linear complexity (O (T) ), enabling the model to incorporate both past and future audio cues for smoother transitions. Furthermore, to ensure robustness against domain shifts, we introduce a spatiotemporal dual-constraint strategy: a facial region-weighted spatial loss to effectively decouple foreground dynamics from the background, and a latent second-order difference consistency loss to suppress high-frequency jitter. We also release the Tibetan Talking Head Dataset (TTHD), the first high-definition talking-head video dataset for Tibetan, comprising approximately 40 hours of video with diverse emotional expressions. Extensive experiments demonstrate that HimaTalk achieves state-of-the-art zero-shot performance on TTHD and competitive results on a comparable real-world dataset. Notably, our framework reduces peak memory usage by 35% and increases inference speed by approximately 45% compared to temporal-attention baselines under comparable settings, while delivering superior temporal coherence to unidirectional state-space model (SSM) approaches.
Du et al. (Wed,) studied this question.