What question did this study set out to answer?

This systematic review aims to evaluate the performance of large language models in generating impressions from radiology reports.

February 5, 2026Open Access

Performance of Large Language Models for Radiology Report Impression Generation: A Systematic Review

Read Full Paperexternally

Key Points

This systematic review aims to evaluate the performance of large language models in generating impressions from radiology reports.
Conducted a search across seven electronic databases to identify relevant literature.
Included 15 eligible papers assessing 35 large language models.
Analyzed performance metrics such as ROUGE, BLEU, METEOR, and subjective clinical usability scores.
Finetuned T5 showed the highest performance across multiple evaluation metrics.
Performance ranged from ROUGE-1 scores of 35.9% to 69.7%, with finetuned T5 scoring 63.8%.
Clinical usability ratings averaged above 4.0 out of 5.0 for key metrics like completeness and fluency.

Abstract

No systematic review has previously examined the application of large language models (LLMs) for generating impressions from radiology report findings. This study systematically reviews the performance of LLMs on this task and their associated evaluation methodologies. A search of seven electronic databases on 7 August 2025 identified 15 eligible papers (average quality score: 71.4%). These articles evaluated 35 LLMs, including 21 base models. The reported performance ranges were as follows: Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-1, 35.9% (Generative Pre-Trained Transformer (GPT)-4) to 69.7% (Baichuan2-13B); ROUGE-2, 13.4% (Large Language Model Meta AI (Llama)) to 52.4% (Baichuan2-13B); and ROUGE-L, 16.5% (Chat General Language Model–Medical (ChatGLM-Med)) to 63.8% (finetuned Text-to-Text Transfer Transformer (T5)). The finetuned T5 consistently demonstrated high performance, based on Bidirectional Encoder Representations from Transformers Score (BERTScore): 89.2%; BiLingual Evaluation Understudy (BLEU)-1: 65.2%; BLEU-2: 57.9%; BLEU-3: 52.5%; BLEU-4: 48.3%; Metric for Evaluation of Translation with Explicit ORdering (METEOR): 38.1%; ROUGE-1: 59.9%; ROUGE-2: 50.9%; ROUGE-L: 63.8%; and subjective metrics (clinical usability: 4.5/5.0; completeness: 4.3/5.0; conciseness: 4.3/5.0; fluency: 4.4/5.0). These results, based on 132,043 computed tomography, echocardiography, magnetic resonance imaging, and X-ray reports, indicate its strong clinical potential for assisting radiologists in impression generation through supervised finetuning rather than prompting techniques used in closed-source LLMs.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Curtise K. C. Ng

Zhonghua Sun

Ian K. H. Te

Journals

Technologies

Actions

Institutions

Curtin University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Performance of Large Language Models for Radiology Report Impression Generation: A Systematic Review

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study