Key points are not available for this paper at this time.
Background Large-scale secondary use of clinical databases requires automated tools for retrospective extraction of structured content from free-text radiology reports. Purpose To share data and insights on the application of privacy-preserving open-weights large language models (LLMs) for reporting content extraction with comparison to standard rule-based systems and the closed-weights LLMs from OpenAI. Materials and Methods In this retrospective exploratory study conducted between May 2024 and September 2024, zero-shot prompting of 17 open-weights LLMs was preformed. These LLMs with model weights released under open licenses were compared with rule-based annotation and with OpenAI's GPT-4o, GPT-4o-mini, GPT-4-turbo, and GPT-3. 5-turbo on a manually annotated public English chest radiography dataset (Indiana University, 3927 patients and reports). An annotated nonpublic German chest radiography dataset (18 500 reports, 16 844 patients 10 340 male; mean age, 62. 6 years ± 21. 5 SD) was used to compare local fine-tuning of all open-weights LLMs via low-rank adaptation and 4-bit quantization to bidirectional encoder representations from transformers (BERT) with different subsets of reports (from 10 to 14 580). Nonoverlapping 95% CIs of macro-averaged F1 scores were defined as relevant differences. Results For the English reports, the highest zero-shot macro-averaged F1 score was observed for GPT-4o (92. 4% 95% CI: 87. 9, 95. 9) ; GPT-4o outperformed the rule-based CheXpert Stanford University (73. 1% 95% CI: 65. 1, 79. 7) but was comparable in performance to several open-weights LLMs (top three: Mistral-Large Mistral AI, 92. 6% 95% CI: 88. 2, 96. 0; Llama-3. 1-70b Meta AI, 92. 2% 95% CI: 87. 1, 95. 8; and Llama-3. 1-405b Meta AI: 90. 3% 95% CI: 84. 6, 94. 5). For the German reports, Mistral-Large (91. 6% 95% CI: 90. 5, 92. 7) had the highest zero-shot macro-averaged F1 score among the six other open-weights LLMs and outperformed the rule-based annotation (74. 8% 95% CI: 73. 3, 76. 1). Using 1000 reports for fine-tuning, all LLMs (top three: Mistral-Large, 94. 3% 95% CI: 93. 5, 95. 2; OpenBioLLM-70b Saama: 93. 9% 95% CI: 92. 9, 94. 8; and Mixtral-8×22b Mistral AI: 93. 8% 95% CI: 92. 8, 94. 7) achieved significantly higher macro-averaged F1 score than did BERT (86. 7% 95% CI: 85. 0, 88. 3) ; however, the differences were not relevant when 2000 or more reports were used for fine-tuning. Conclusion LLMs have the potential to outperform rule-based systems for zero-shot "out-of-the-box" structuring of report databases, with privacy-ensuring open-weights LLMs being competitive with closed-weights GPT-4o. Additionally, the open-weights LLM outperformed BERT when moderate numbers of reports were used for fine-tuning. Published under a CC BY 4. 0 license. Supplemental material is available for this article. See also the editorial by Gee and Yao in this issue.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sebastian Nowak
Benjamin Wulff
Yannik C. Layer
Radiology
University Hospital Bonn
Building similarity graph...
Analyzing shared references across papers
Loading...
Nowak et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69daa51200ab073a27838a51 — DOI: https://doi.org/10.1148/radiol.240895