What question did this study set out to answer?

The study aims to improve 3D MRI synthesis by addressing spatial dependencies and enhancing vision-language alignment.

May 15, 2026

Refine Then Fusion: Robust 3D Brain MRI Synthesis via Vision–Language Collaboration

Key Points

The study aims to improve 3D MRI synthesis by addressing spatial dependencies and enhancing vision-language alignment.
Developed RTFSyn, a 3D MRI synthesis framework combining visual refinement and cross-modal fusion.
Employed a multi-center dataset for validation, evaluating robustness across various imaging conditions.
Implemented novel modules for directional dependency capture and semantic injection in the synthesis process.
RTFSyn outperformed existing state-of-the-art methods with significant improvements in accuracy and robustness.
Achieved high fidelity synthesis while maintaining favorable computational efficiency across diverse imaging artifacts.
Showed effective performance in zero-shot evaluations and multi-dimensional clinical validations.

Abstract

Metadata-guided cross-modality 3D MRI synthesis aims to generate target-contrast volumes from source-modality data conditioned on clinically available metadata, which is important for enhancing clinical imaging flexibility. However, existing methods still suffer from two main limitations: 1) They neglect spatial dependencies within volumetric representations, yielding structurally ambiguous features that blur anatomical boundaries and hinder precise semantic integration. 2) They rely on conventional cross-attention between visual and textual features, limiting the precision of visual-semantic alignment, which reduces robustness across challenging conditions. To address these issues, we propose RTFSyn, a metadata-guided 3D MRI synthesis framework that achieves effective vision-language collaboration through a refine-then-fusion paradigm. The proposed RTFSyn benefits from several merits. First, we design an axis-aware visual refinement module that captures directional dependencies within volumetric features, enabling redundancy suppression and improved structural representation before fusion. Second, we propose a cross-modal adaptive fusion module that leverages pixel packing-recovery to realize efficient cross-attention for improved alignment, while text-conditioned dynamic convolution enables fine-grained semantic injection, together enhancing vision-language collaboration. Lastly, an implicit neural decoder reconstructs the target modality as a continuous function, enabling flexible high-fidelity synthesis. Under this synergistic paradigm, RTFSyn seamlessly unites robust spatial refinement with adaptive feature fusion to achieve highly precise cross-modal alignment. Extensive experiments across four multi-center datasets demonstrate that RTFSyn not only surpasses state-of-the-art methods quantitatively, but also exhibits robust performance under diverse imaging artifacts, zero-shot evaluations, and multi-dimensional clinical validations, all with favorable computational efficiency. The high fidelity, robustness, and efficiency of RTFSyn demonstrate its great potential for clinical applications.

Bookmark

Refine Then Fusion: Robust 3D Brain MRI Synthesis via Vision–Language Collaboration

Key Points

Abstract

Cite This Study