No systematic review has previously examined the application of large language models (LLMs) for generating impressions from radiology report findings. This study systematically reviews the performance of LLMs on this task and their associated evaluation methodologies. A search of seven electronic databases on 7 August 2025 identified 15 eligible papers (average quality score: 71.4%). These articles evaluated 35 LLMs, including 21 base models. The reported performance ranges were as follows: Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-1, 35.9% (Generative Pre-Trained Transformer (GPT)-4) to 69.7% (Baichuan2-13B); ROUGE-2, 13.4% (Large Language Model Meta AI (Llama)) to 52.4% (Baichuan2-13B); and ROUGE-L, 16.5% (Chat General Language Model–Medical (ChatGLM-Med)) to 63.8% (finetuned Text-to-Text Transfer Transformer (T5)). The finetuned T5 consistently demonstrated high performance, based on Bidirectional Encoder Representations from Transformers Score (BERTScore): 89.2%; BiLingual Evaluation Understudy (BLEU)-1: 65.2%; BLEU-2: 57.9%; BLEU-3: 52.5%; BLEU-4: 48.3%; Metric for Evaluation of Translation with Explicit ORdering (METEOR): 38.1%; ROUGE-1: 59.9%; ROUGE-2: 50.9%; ROUGE-L: 63.8%; and subjective metrics (clinical usability: 4.5/5.0; completeness: 4.3/5.0; conciseness: 4.3/5.0; fluency: 4.4/5.0). These results, based on 132,043 computed tomography, echocardiography, magnetic resonance imaging, and X-ray reports, indicate its strong clinical potential for assisting radiologists in impression generation through supervised finetuning rather than prompting techniques used in closed-source LLMs.
Building similarity graph...
Analyzing shared references across papers
Loading...
Curtise K. C. Ng
Zhonghua Sun
Ian K. H. Te
Technologies
Curtin University
Building similarity graph...
Analyzing shared references across papers
Loading...
Ng et al. (Mon,) studied this question.
synapsesocial.com/papers/6984360af1d9ada3c1fb5989 — DOI: https://doi.org/10.3390/technologies14020099