We propose a multimodal Retrieval-Augmented Generation (RAG) framework for generating clinically accurate radiology reports from chest X-rays. Our study systematically evaluates similarity metrics for retrieval and the impact of negative sampling within the RAG pipeline. The approach extracts predicted findings and scores from a TorchXRayVision model, builds vector indices over MIMIC-CXR training reports, retrieves relevant neighbors using multiple strategies, and generates reports via a Large Language Model. In 16 controlled experiments, RAG consistently outperformed a non-retrieval baseline across both clinical (CheXbert and RadGraph) and linguistic (BERTScore, BLEU, METEOR, ROUGE-L) metrics. Moreover, adding explicit negative sampling at the prompt level consistently degrades performance, indicating that dissimilar reports confuse the LLM rather than provide useful guidance. Conceptually, RAG grounds a general-purpose LLM with precise, case-specific exemplars, steering it toward the specialized phrasing and clinical judgment of an expert radiologist.
Zamaninejad et al. (Thu,) studied this question.