Multimodal generative artificial intelligence (AI) has emerged as a transformative approach in medical diagnostics, integrating diverse data sources to significantly enhance clinical decision-making and patient care. In this review, we systematically analyze recent advancements and methodologies in multimodal generative AI, focusing particularly on the fusion of medical imaging data with clinical records, genomic information, and textual narratives. We evaluate how these combined modalities closely mimic physician cognitive processes, leading to improved diagnostic accuracy and personalized patient management across various specialties including radiology, pathology, dermatology, and ophthalmology. Specifically, we discuss three key integration strategies: tool-use approaches, where large language models orchestrate specialized diagnostic modules; grafting techniques, which directly incorporate visual analysis into linguistic frameworks; and unified frameworks, providing simultaneous multimodal data processing within cohesive models. Additionally, we highlight exemplary models, such as PathChat, demonstrating substantial accuracy improvements (e.g., 89.5% in pathological image interpretation) resulting from multimodal integration. We also critically assess ongoing challenges, including technical barriers to data integration, interpretability issues affecting clinical trust, privacy and ethical concerns, and the evolving regulatory landscape surrounding AI-driven diagnostics. Finally, we propose directions for future research, emphasizing the need for large-scale clinical validation studies, standardized evaluation frameworks, advances in explainable AI methods, and privacy-preserving techniques such as federated learning. Ultimately, multimodal generative AI holds significant promise to augment rather than replace clinical expertise, serving as a powerful complement to human decision-making in medicine.
Maleki et al. (Wed,) studied this question.