What type of study is this?

This is a Quantitative Study study.

October 5, 2025

Leveraging open-source large language models for clinical information extraction in resource-constrained settings.

Puntos clave

Open-source generative language models excel in zero-shot clinical information extraction tasks.
Llama-3.3-70B achieved the highest utility score of 0.760 across 28 tasks evaluated on the DRAGON benchmark.
Translation to English consistently hinders model performance, suggesting the need for support in native languages.
Models with around 14B parameters performed well, yet NER tasks consistently showed low performance.

Resumen

We aimed to evaluate the zero-shot performance of open-source generative large language models (LLMs) on clinical information extraction from Dutch medical reports using the Diagnostic Report Analysis: General Optimization of NLP (DRAGON) benchmark. We developed and released the llmₑxtractinator framework, a scalable, open-source tool for automating information extraction from clinical texts using LLMs. We evaluated 9 multilingual open-source LLMs across 28 tasks in the DRAGON benchmark, covering classification, regression, and named entity recognition (NER). All tasks were performed in a zero-shot setting. Model performance was quantified using task-specific metrics and aggregated into a DRAGON utility score. Additionally, we investigated the effect of in-context translation to English. Llama-3. 3-70B achieved the highest utility score (0. 760), followed by Phi-4-14B (0. 751), Qwen-2. 5-14B (0. 748), and DeepSeek-R1-14B (0. 744). These models outperformed or matched a fine-tuned RoBERTa baseline on 17 of 28 tasks, particularly in regression and structured classification. NER performance was consistently low across all models. Translation to English consistently reduced performance. Generative LLMs demonstrated strong zero-shot capabilities on clinical natural language processing tasks involving structured inference. Models around 14B parameters performed well overall, with Llama-3. 3-70B leading but at high computational cost. Generative models excelled in regression tasks, but were hindered by token-level output formats for NER. Translation to English reduced performance, emphasizing the need for native language support. Open-source generative LLMs provide a viable zero-shot alternative for clinical information extraction from Dutch medical texts, particularly in low-resource and multilingual settings.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Luc Builtjes

Joeran S. Bosma

Mathias Prokop

Actions

Institutions

Radboud University Nijmegen

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Leveraging open-source large language models for clinical information extraction in resource-constrained settings.

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study