What question did this study set out to answer?

The aim is to develop a controllable scene text-image generation framework for low-resource Southeast Asian languages.

May 6, 2026Open Access

FG-Text-SD: Training-Free Controllable Scene Text–Image Generation for Low-Resource Southeast Asian Languages

Key Points

The aim is to develop a controllable scene text-image generation framework for low-resource Southeast Asian languages.
Developed FG-Text-SD using a training-free approach based on Stable Diffusion.
Organized multiple text instances in an instance-level manner.
Injected glyph structure priors in the denoising process to stabilize complex character shapes.
Utilized cross-attention with carrier-aware masks for better text alignment.
Implemented OCR-guided local repainting for correcting local errors.
Achieved improved text accuracy and readability across various benchmarks.
Demonstrated strong performance in low-resource language subsets including Thai, Lao, Khmer, and Burmese.
Enhanced spatial consistency without needing additional model retraining.

Abstract

Scene text–image generation aims to synthesize natural images containing readable and visually coherent text. Although recent diffusion-based methods have shown promising results, they often struggle with low-resource Southeast Asian languages because of complex glyph structures, limited language resources, and weak alignment between generated text and background carriers. To address this issue, we propose FG-Text-SD, a training-free controllable scene text–image generation framework built on Stable Diffusion. The proposed framework organizes multiple text instances in an instance-level manner, injects rendered glyph structure priors into the denoising process to stabilize complex character shapes, modulates cross-attention with carrier-aware masks to improve text-to-surface alignment, and employs OCR-guided local repainting to correct residual local errors. Experiments are conducted on AnyText-benchmark, CVTG-2K, and a newly constructed evaluation set covering Thai, Lao, Khmer, and Burmese. The proposed method achieves strong performance on both public benchmarks and low-resource language subsets, improving text accuracy, readability, and spatial consistency without additional model retraining. These results demonstrate that FG-Text-SD provides an effective solution for controllable scene text–image generation in low-resource multilingual settings.

FG-Text-SD: Training-Free Controllable Scene Text–Image Generation for Low-Resource Southeast Asian Languages

Key Points

Abstract

Cite This Study

Also Consider

Also Consider