Scene text–image generation aims to synthesize natural images containing readable and visually coherent text. Although recent diffusion-based methods have shown promising results, they often struggle with low-resource Southeast Asian languages because of complex glyph structures, limited language resources, and weak alignment between generated text and background carriers. To address this issue, we propose FG-Text-SD, a training-free controllable scene text–image generation framework built on Stable Diffusion. The proposed framework organizes multiple text instances in an instance-level manner, injects rendered glyph structure priors into the denoising process to stabilize complex character shapes, modulates cross-attention with carrier-aware masks to improve text-to-surface alignment, and employs OCR-guided local repainting to correct residual local errors. Experiments are conducted on AnyText-benchmark, CVTG-2K, and a newly constructed evaluation set covering Thai, Lao, Khmer, and Burmese. The proposed method achieves strong performance on both public benchmarks and low-resource language subsets, improving text accuracy, readability, and spatial consistency without additional model retraining. These results demonstrate that FG-Text-SD provides an effective solution for controllable scene text–image generation in low-resource multilingual settings.
Shi et al. (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: