We aimed to evaluate the zero-shot performance of open-source generative large language models (LLMs) on clinical information extraction from Dutch medical reports using the Diagnostic Report Analysis: General Optimization of NLP (DRAGON) benchmark. We developed and released the llmₑxtractinator framework, a scalable, open-source tool for automating information extraction from clinical texts using LLMs. We evaluated 9 multilingual open-source LLMs across 28 tasks in the DRAGON benchmark, covering classification, regression, and named entity recognition (NER). All tasks were performed in a zero-shot setting. Model performance was quantified using task-specific metrics and aggregated into a DRAGON utility score. Additionally, we investigated the effect of in-context translation to English. Llama-3. 3-70B achieved the highest utility score (0. 760), followed by Phi-4-14B (0. 751), Qwen-2. 5-14B (0. 748), and DeepSeek-R1-14B (0. 744). These models outperformed or matched a fine-tuned RoBERTa baseline on 17 of 28 tasks, particularly in regression and structured classification. NER performance was consistently low across all models. Translation to English consistently reduced performance. Generative LLMs demonstrated strong zero-shot capabilities on clinical natural language processing tasks involving structured inference. Models around 14B parameters performed well overall, with Llama-3. 3-70B leading but at high computational cost. Generative models excelled in regression tasks, but were hindered by token-level output formats for NER. Translation to English reduced performance, emphasizing the need for native language support. Open-source generative LLMs provide a viable zero-shot alternative for clinical information extraction from Dutch medical texts, particularly in low-resource and multilingual settings.
Building similarity graph...
Analyzing shared references across papers
Loading...
Luc Builtjes
Joeran S. Bosma
Mathias Prokop
Radboud University Nijmegen
Building similarity graph...
Analyzing shared references across papers
Loading...
Builtjes et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68e2537cd6d66a53c24745be — DOI: https://doi.org/10.1093/jamiaopen/ooaf109