Image fusion seeks to seamlessly integrate fore ground objects with background scenes, producing realistic and harmonious fused images. While existing methods often insert objects directly, adaptive and interactive fusion-requiring contextual adaptation and foreground-background interplay-remains a challenging yet critical task. To address this, we first propose a pipeline for generating high-quality fusion data. By combining iterative in-context learning with existing tools, we curate a diverse cross-scene dataset supporting three core tasks: object integration, replacement, and attribute-referenced editing. Lever aging this, we introduce DreamFuse, a unified diffusion-based approach that jointly optimizes these capabilities. DreamFuse exploits the Diffusion Transformer (DiT) architecture, using its attention mechanism to extract and align foreground-background features for coherent fusion. For flexible control, we incorporate a Positional Affine mechanism, enabling precise spatial and scale adjustments while supporting diverse text-driven fusion. Fur thermore, we employ Localized Direct Preference Optimization (L-DPO), refining the model via human feedback to enhance harmony and consistency. Extensive experimental results demon strate DreamFuse's superiority over state-of-the-art approaches across multiple metrics.
Huang et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: