What question did this study set out to answer?

May 20, 2026

C56-37 Using Large Language Models to Interpret Radiology Reports for Identification of Pneumonia on Imaging

Key Points

The aim was to utilize large language models to interpret radiology reports for detecting pneumonia in imaging.
Randomly selected 230 radiology reports from pneumonia-diagnosed patients and assigned labels ('Yes', 'Possible', 'No').
Developed instruction prompts using a meta-prompting approach for LLMs (Grok4 Expert, Llama 3:8b, MedGemma:27b).
Evaluated performance metrics including sensitivity, specificity, positive predictive value, and F1-score.
Sensitivity for 'Yes' cases ranged from 0.51-0.65; specificity was high (0.91-0.96).
For 'Possible' cases, sensitivity was low for Llama 3:8b (0.05-0.23) but higher for MedGemma:27b (0.59-0.82).
Overall, MedGemma:27b P1 achieved the highest ordinal-aware F1 score (0.584).

Abstract

Abstract Rationale Automated cohort identification and phenotyping across an institution’s data warehouse can benefit from incorporating free-text notes and reports. Radiology reports provide an expert free-text interpretation of a radiology image. The use of free-text reports as a phenotyping tool has been challenging and is underutilized. However, with the advent of large language models (LLMs), developing automated workflows to process free-text is more feasible than ever before. Our aim was to leverage large language models (LLMs) to interpret radiology reports, specifically identifying the presence of pneumonia on imaging and the reasoning leading to the adjudication. Methods We randomly selected 230 radiology reports from patients that had a diagnosis of pneumonia on admission to the hospital. Each de-identified report was reviewed by a clinician and assigned one of three labels (‘Yes’, ‘Possible’, ‘No’) for presence of pneumonia on the source imaging. To generate the LLM instruction prompt, we used a meta-prompting approach where an initial human-drafted instruction prompt was provided to an LLM (Grok4 Expert) to generate an improved version. The resulting instruction prompt 1 (P1) was further refined into a second instruction prompt 2 (P2) by providing the same LLM with 15 example text segments and reference labels. Two open source LLMs (Llama 3:8b and MedGemma:27b) were separately configured with each instruction prompt and applied to the reports to assign one of the three pneumonia labels. Performance was calculated using sensitivity, specificity, positive predictive value (PPV), and an ordinal-aware weighted F1-score. Results Performance of LLMs varied significantly by prompt and diagnostic label. For ‘Yes’ cases, sensitivity ranged from 0.51-0.65 with consistently high specificity (0.91-0.96). Sensitivity for ‘Possible’ cases was low for Llama 3:8b prompts (0.05-0.23) but higher for MedGemma:27b prompts (0.59-0.82), while specificity remained high. For ‘No’ cases, sensitivity was highest for Llama 3:8b P2 (0.99) and MedGemma:27b P2 (0.94), with moderate specificity. Overall, MedGemma:27b P1 demonstrated the most balanced performance across all categories, achieving the highest ordinal-aware F1 score (0.584). Conclusion LLMs can be useful tools for interpreting radiology reports but require careful prompt engineering. Our findings argue against the use of zero-shot prompting for interpretation of radiology reports. Study limitations include ambiguity in distinguishing between key findings like pneumonia and atelectasis. Future approaches designed to identify specific radiologic findings, rather than a broad diagnosis, may achieve greater accuracy. This abstract is funded by: None

AIに質問

Bookmark

Cite This Study

Poluch et al. (Fri,) studied this question.

synapsesocial.com/papers/6a0d5098f03e14405aa9c863 https://doi.org/https://doi.org/10.1093/ajrccm/aamag162.1061

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

AIに質問

Bookmark