What question did this study set out to answer?

The aim is to develop an optimized system for generating Tibetan talking heads that addresses issues like data scarcity and temporal dependencies.

March 7, 2026Open Access

HimaTalk: robust audio-driven Tibetan talking head generation via bi-directional Mamba and spatiotemporal constraints

Key Points

The aim is to develop an optimized system for generating Tibetan talking heads that addresses issues like data scarcity and temporal dependencies.
Introduced HimaTalk framework replacing standard temporal attention with Bidirectional Mamba Motion Module.
Utilized dual scanning mechanism for capturing global context with linear complexity.
Implemented spatiotemporal constraints for improved robustness against domain shifts.
Released the Tibetan Talking Head Dataset (TTHD) comprising 40 hours of diverse video data.
Achieved state-of-the-art zero-shot performance on TTHD.
Reduced peak memory usage by 35% and increased inference speed by 45% compared to baseline models.
Showed superior temporal coherence compared to unidirectional state-space model approaches.

Abstract

High-fidelity audio-driven talking-head generation holds significant promise for digital communication, online education, and the preservation of minority languages. However, applying existing diffusion-based frameworks to low-resource languages like Tibetan remains challenging, primarily due to data scarcity and the difficulty of generalizing to unseen prosodic patterns. Current state-of-the-art systems typically rely on Transformer-based temporal attention mechanisms. While effective for short sequences, these architectures suffer from quadratic complexity (O (T²) ), which limits the context window and hinders the modeling of long-range temporal dependencies. In cross-lingual transfer, these limitations can be further exacerbated, and models may struggle to maintain structural stability, leading to artifacts such as motion stiffness, background warping, and temporal flickering. To address these limitations, we present HimaTalk (Himalayan Talking Head), a unified framework designed to optimize generation quality, computational efficiency, and cross-lingual generalization. We replace the standard temporal attention layer with a Bidirectional Mamba Motion Module. By employing a dual scanning mechanism (forward and backward) with learnable gated fusion, this module captures global temporal context with linear complexity (O (T) ), enabling the model to incorporate both past and future audio cues for smoother transitions. Furthermore, to ensure robustness against domain shifts, we introduce a spatiotemporal dual-constraint strategy: a facial region-weighted spatial loss to effectively decouple foreground dynamics from the background, and a latent second-order difference consistency loss to suppress high-frequency jitter. We also release the Tibetan Talking Head Dataset (TTHD), the first high-definition talking-head video dataset for Tibetan, comprising approximately 40 hours of video with diverse emotional expressions. Extensive experiments demonstrate that HimaTalk achieves state-of-the-art zero-shot performance on TTHD and competitive results on a comparable real-world dataset. Notably, our framework reduces peak memory usage by 35% and increases inference speed by approximately 45% compared to temporal-attention baselines under comparable settings, while delivering superior temporal coherence to unidirectional state-space model (SSM) approaches.

Bookmark

View Full Paper

Cite This Study

Du et al. (Wed,) studied this question.

synapsesocial.com/papers/69abc0925af8044f7a4e9498 https://doi.org/https://doi.org/10.1007/s44443-026-00617-6

Bookmark

View Full Paper