The use of AI in radiology is rapidly transitioning towards the use of foundation models that learn universal representations from large-scale imaging and multimodal clinical data. In this review, we outline the key technical components of these models, including self-supervised encoders, fusion modules for feature alignment, and task-specific decoders, and further summarize the recent work in three categories: 1) image-only models, trained on millions of unlabeled radiology scans to enable robust transfer learning with minimal annotation, 2) Chest Xray image-report models, which leverage large chest X-ray–reported corpora for joint visual and textual embedding, and 3) image-report models for other modalities, which fuse volumetric images with structured reports or clinical metadata. We further discuss the relevant evaluation strategies, including vision-centric, language-centric, and benchmark-based metrics, and outline approaches for clinical validation. Finally, we highlight the persistent challenges in the application of these models, and propose future directions for multimodal integration and human–AI collaboration to advance personalized radiology.
Dongheon Lee (Wed,) studied this question.