Motivation: We aim to introduce a foundation model based on visual and textual inputs to enable robust, unified image synthesis in multimodal MRI. Goal(s): Our goal is to demonstrate a versatile foundation model, with language guidance for accurate target descriptions, that adapts easily to new modalities and datasets, using computationally efficient fine-tuning strategies with minimal additional data and training. Approach: Our approach conditions synthesis on source-modality images and target-modality text descriptions, via a text encoder to embed textual inputs, one-step latent diffusion model to perform fast synthesis, and low-rank adaptation for efficient fine-tuning. Results: We demonstrated high-quality synthesis performance over various modalities and datasets. Impact: Conventional synthesis models rely on image-to-image translation with just visual inputs and often show limited generalizability. We demonstrate a foundation model with language guidance that leverages textual inputs for improved adaptability to new modalities.
Yurt et al. (Tue,) studied this question.