Clinical discharge summaries contain rich patient information but remain difficult to convert into structured representations for downstream analysis. Recent advances in large language models (LLMs) have introduced new approaches for clinical text extraction, yet their relative strengths compared with supervised methods remain unclear. This study presents a controlled evaluation of three dominant strategies for structured clinical information extraction from electronic health records: prompting-based extraction using LLMs, retrieval-augmented generation for terminology canonicalization, and supervised fine-tuning of domain-specific transformer models. Using discharge summaries from the MIMIC-IV dataset, we compare zero-shot, few-shot, and verification-based prompting across closed-source and open-source LLMs, evaluate retrieval-augmented canonicalization as a post-processing mechanism, and benchmark these methods against a fine-tuned BioClinicalBERT model. Performance is assessed using a multi-level evaluation framework that combines exact matching, fuzzy lexical matching, and semantic assessment via an LLM-based judge. The results reveal clear tradeoffs across approaches: prompting achieves strong semantic correctness with minimal supervision, retrieval augmentation improves terminology consistency without expanding extraction coverage, and supervised fine-tuning yields the highest overall accuracy when labeled data are available. Across all methods, we observe a consistent 40−50% gap between exact-match and semantic correctness, highlighting the limitations of string-based metrics for clinical Natural Language Processing (NLP). These findings provide practical guidance for selecting extraction strategies under varying resource constraints and emphasize the importance of evaluation methodologies that reflect clinical equivalence rather than surface-form similarity.
Yadav et al. (Fri,) studied this question.