Text-to-image generation for Thangka artwork requires high-resolution synthesis and precise semantic–spatial alignment. However, existing diffusion models suffer from high computational overhead and struggle with spatial reasoning at scale. This paper introduces MMED, a Mamba-enhanced multi-scale diffusion framework for efficient Thangka generation. First, the Mamba Spatial Mixer (MSM) replaces quadratic self-attention with adaptive sparse grid-scanning, achieving near-linear complexity while capturing long-range dependencies. Second, the Dual-Stream Gated Mamba Cross-Attention (DSG-MCA) module couples textual instructions with positional encodings for fine-grained semantic-spatial precision. Third, the Adaptive Parallel Mamba Residual (APMR) block integrates convolution with state-space dynamics to improve feature propagation and training stability. Experiments on a curated Thangka dataset show MMED reduces training time by 35% and inference latency by 60%, while achieving 16% FID improvement and 31% IS increase over strong baselines. Superior performance on the CUB dataset validates the generalization capability, offering a new perspective for efficient cultural heritage generation.
Hu et al. (Sat,) studied this question.