January 1, 2025Open Access

Privacy-ensuring Open-weights Large Language Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports

Key Points

Key points are not available for this paper at this time.

Abstract

Background Large-scale secondary use of clinical databases requires automated tools for retrospective extraction of structured content from free-text radiology reports. Purpose To share data and insights on the application of privacy-preserving open-weights large language models (LLMs) for reporting content extraction with comparison to standard rule-based systems and the closed-weights LLMs from OpenAI. Materials and Methods In this retrospective exploratory study conducted between May 2024 and September 2024, zero-shot prompting of 17 open-weights LLMs was preformed. These LLMs with model weights released under open licenses were compared with rule-based annotation and with OpenAI's GPT-4o, GPT-4o-mini, GPT-4-turbo, and GPT-3. 5-turbo on a manually annotated public English chest radiography dataset (Indiana University, 3927 patients and reports). An annotated nonpublic German chest radiography dataset (18 500 reports, 16 844 patients 10 340 male; mean age, 62. 6 years ± 21. 5 SD) was used to compare local fine-tuning of all open-weights LLMs via low-rank adaptation and 4-bit quantization to bidirectional encoder representations from transformers (BERT) with different subsets of reports (from 10 to 14 580). Nonoverlapping 95% CIs of macro-averaged F1 scores were defined as relevant differences. Results For the English reports, the highest zero-shot macro-averaged F1 score was observed for GPT-4o (92. 4% 95% CI: 87. 9, 95. 9) ; GPT-4o outperformed the rule-based CheXpert Stanford University (73. 1% 95% CI: 65. 1, 79. 7) but was comparable in performance to several open-weights LLMs (top three: Mistral-Large Mistral AI, 92. 6% 95% CI: 88. 2, 96. 0; Llama-3. 1-70b Meta AI, 92. 2% 95% CI: 87. 1, 95. 8; and Llama-3. 1-405b Meta AI: 90. 3% 95% CI: 84. 6, 94. 5). For the German reports, Mistral-Large (91. 6% 95% CI: 90. 5, 92. 7) had the highest zero-shot macro-averaged F1 score among the six other open-weights LLMs and outperformed the rule-based annotation (74. 8% 95% CI: 73. 3, 76. 1). Using 1000 reports for fine-tuning, all LLMs (top three: Mistral-Large, 94. 3% 95% CI: 93. 5, 95. 2; OpenBioLLM-70b Saama: 93. 9% 95% CI: 92. 9, 94. 8; and Mixtral-8×22b Mistral AI: 93. 8% 95% CI: 92. 8, 94. 7) achieved significantly higher macro-averaged F1 score than did BERT (86. 7% 95% CI: 85. 0, 88. 3) ; however, the differences were not relevant when 2000 or more reports were used for fine-tuning. Conclusion LLMs have the potential to outperform rule-based systems for zero-shot "out-of-the-box" structuring of report databases, with privacy-ensuring open-weights LLMs being competitive with closed-weights GPT-4o. Additionally, the open-weights LLM outperformed BERT when moderate numbers of reports were used for fine-tuning. Published under a CC BY 4. 0 license. Supplemental material is available for this article. See also the editorial by Gee and Yao in this issue.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper