What does this research mean for the field?

Large Language Models can effectively automate the extraction of structured medication information from clinical texts, with performance varying significantly between zero-shot and few-shot paradigms. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

February 28, 2026Open Access

Feasibility of Using Large Language Models for Structured Medication Extraction from Clinical Text: A Comparative Analysis of Zero-Shot and Few-Shot Paradigms

Key Points

The aim is to evaluate the feasibility of using large language models for structured medication extraction from unstructured clinical narratives.
Evaluated five open-weight architectures: GPT-OSS:20B, Gemma 2:9B, Mistral 7B, Qwen3:14B, and Llama 3.2.
Employing Zero-Shot and Few-Shot prompting paradigms for performance comparison.
Sourced outpatient antimicrobial clinical notes from the CHARM registry.
GPT-OSS:20B achieved F1 scores greater than 0.90 in the Zero-Shot setting.
Gemma 2:9B reached approximately F1 ~ 0.99, outperforming larger models in Few-Shot tasks.
Smaller models encountered a 'hallucination barrier', limiting their unsupervised clinical application.

Abstract

The digitization of healthcare has been accompanied by a rapid expansion of electronic health records (EHRs); however, a significant proportion of critical patient data, specifically medication regimens, remains entrapped within unstructured clinical narratives. The inability to seamlessly compute this data hinders advancements in pharmacovigilance, clinical decision support, and population health management. This study presents a comprehensive, rigorous evaluation of the feasibility of deploying Large Language Models (LLMs) to automate the extraction of structured dosage information (Dose, Daily Frequency, Duration) from outpatient antimicrobial clinical notes sourced from the Collaboration to Harmonize Antimicrobial Registry Measures (CHARM) registry. We scrutinized the performance of five distinct open-weight architectures, namely GPT-OSS:20B, Gemma 2:9B, Mistral 7B, Qwen3:14B and Llama 3.2, across both Zero-Shot and Retrieval Augmented Generation (RAG)-based Few-Shot prompting paradigms. Our analysis reveals a fundamental architectural trade-off: the reasoning-optimized GPT-OSS:20B dominates the zero-shot landscape (F1 > 0.90) by leveraging abstract schema understanding, whereas the instruction-tuned Gemma 2:9B excels in the few-shot setting (F1 ~ 0.99), effectively utilizing examples as guardrails to surpass larger models. Conversely, smaller models (Mistral, Llama) exhibit a prohibitive “hallucination barrier,” rendering them unsafe for unsupervised clinical application. Furthermore, we identify “Inconsistent Unit Handling” and “Complex Temporal Logic” as persistent failure modes that resist simple scaling laws. This report provides a definitive framework for selecting model architectures based on the availability of few-shot examples and highlights the necessity of dynamic RAG strategies to achieve production-grade reliability in medical informatics.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Schulte et al. (Fri,) studied this question.

synapsesocial.com/papers/69a287f20a974eb0d3c03ced https://doi.org/https://doi.org/10.3390/app16052300

Bookmark

View Full Paper