Abstract Rationale Automated cohort identification and phenotyping across an institution’s data warehouse can benefit from incorporating free-text notes and reports. Radiology reports provide an expert free-text interpretation of a radiology image. The use of free-text reports as a phenotyping tool has been challenging and is underutilized. However, with the advent of large language models (LLMs), developing automated workflows to process free-text is more feasible than ever before. Our aim was to leverage large language models (LLMs) to interpret radiology reports, specifically identifying the presence of pneumonia on imaging and the reasoning leading to the adjudication. Methods We randomly selected 230 radiology reports from patients that had a diagnosis of pneumonia on admission to the hospital. Each de-identified report was reviewed by a clinician and assigned one of three labels (‘Yes’, ‘Possible’, ‘No’) for presence of pneumonia on the source imaging. To generate the LLM instruction prompt, we used a meta-prompting approach where an initial human-drafted instruction prompt was provided to an LLM (Grok4 Expert) to generate an improved version. The resulting instruction prompt 1 (P1) was further refined into a second instruction prompt 2 (P2) by providing the same LLM with 15 example text segments and reference labels. Two open source LLMs (Llama 3:8b and MedGemma:27b) were separately configured with each instruction prompt and applied to the reports to assign one of the three pneumonia labels. Performance was calculated using sensitivity, specificity, positive predictive value (PPV), and an ordinal-aware weighted F1-score. Results Performance of LLMs varied significantly by prompt and diagnostic label. For ‘Yes’ cases, sensitivity ranged from 0.51-0.65 with consistently high specificity (0.91-0.96). Sensitivity for ‘Possible’ cases was low for Llama 3:8b prompts (0.05-0.23) but higher for MedGemma:27b prompts (0.59-0.82), while specificity remained high. For ‘No’ cases, sensitivity was highest for Llama 3:8b P2 (0.99) and MedGemma:27b P2 (0.94), with moderate specificity. Overall, MedGemma:27b P1 demonstrated the most balanced performance across all categories, achieving the highest ordinal-aware F1 score (0.584). Conclusion LLMs can be useful tools for interpreting radiology reports but require careful prompt engineering. Our findings argue against the use of zero-shot prompting for interpretation of radiology reports. Study limitations include ambiguity in distinguishing between key findings like pneumonia and atelectasis. Future approaches designed to identify specific radiologic findings, rather than a broad diagnosis, may achieve greater accuracy. This abstract is funded by: None
Poluch et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: