What question did this study set out to answer?

The aim is to improve text-to-image generation quality in the MMDiT framework when faced with similar subjects.

February 14, 2026

Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation

Key Points

The aim is to improve text-to-image generation quality in the MMDiT framework when faced with similar subjects.
Identified ambiguities in the MMDiT architecture: Inter-block, Text Encoder, and Semantic Ambiguity.
Proposed three tailored loss functions: Block Alignment Loss, Text Encoder Alignment Loss, and Overlap Loss.
Implemented test-time optimization at early denoising steps to address latent ambiguities.
Developed Overlap Online Detection and Back-to-Start Sampling Strategy.
Experimental results showed significant improvements in generation quality.
MMDiT models demonstrated higher success rates compared to existing methods.
The approach was validated on a challenging dataset specifically designed for similar subjects.

Abstract

Representing the cutting-edge technique of text-to-image models, the latest Multimodal Diffusion Transformer (MMDiT) largely mitigates many generation issues existing in previous models. However, we discover that it still suffers from subject neglect or mixing when the input text prompt contains multiple subjects of similar semantics or appearance. We identify three possible ambiguities within the MMDiT architecture that cause this problem: Inter-block Ambiguity, Text Encoder Ambiguity, and Semantic Ambiguity. To address these issues, we propose to repair the ambiguous latent on-the-fly by test-time optimization at early denoising steps. In detail, we design three loss functions: Block Alignment Loss, Text Encoder Alignment Loss, and Overlap Loss, each tailored to mitigate these ambiguities. Despite significant improvements, we observe that semantic ambiguity persists when generating multiple similar subjects, as the guidance provided by overlap loss is not explicit enough. Therefore, we further propose Overlap Online Detection and Back-to-Start Sampling Strategy to alleviate the problem. Experimental results on a newly constructed challenging dataset of similar subjects validate the effectiveness of our approach, showing superior generation quality and much higher success rates over existing methods. The consistent and substantial improvements observed across multiple MMDiT based text-to-image models such as SD3, SD3.5 and FLUX provide strong evidence of the general applicability of our method. Project page: https://wtybest.github.io/projects/EnMMDiT/.

Bookmark

Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation

Key Points

Abstract

Cite This Study