What question did this study set out to answer?

This study aims to evaluate and compare different strategies for clinical text extraction from electronic health records.

March 15, 2026Open Access

Understanding Tradeoffs in Clinical Text Extraction: Prompting, Retrieval-Augmented Generation, and Supervised Learning on Electronic Health Records

Key Points

This study aims to evaluate and compare different strategies for clinical text extraction from electronic health records.
Controlled evaluation of prompting-based extraction, retrieval-augmented generation, and supervised fine-tuning methods.
Use of the MIMIC-IV dataset containing clinical discharge summaries for analysis.
Performance assessed through exact matching, fuzzy lexical matching, and semantic assessments using an LLM-based judge.
Prompting achieves strong semantic correctness with minimal supervision.
Retrieval augmentation improves terminology consistency but doesn't expand extraction coverage.
Supervised fine-tuning provides the highest accuracy when labeled data is available.

Abstract

Clinical discharge summaries contain rich patient information but remain difficult to convert into structured representations for downstream analysis. Recent advances in large language models (LLMs) have introduced new approaches for clinical text extraction, yet their relative strengths compared with supervised methods remain unclear. This study presents a controlled evaluation of three dominant strategies for structured clinical information extraction from electronic health records: prompting-based extraction using LLMs, retrieval-augmented generation for terminology canonicalization, and supervised fine-tuning of domain-specific transformer models. Using discharge summaries from the MIMIC-IV dataset, we compare zero-shot, few-shot, and verification-based prompting across closed-source and open-source LLMs, evaluate retrieval-augmented canonicalization as a post-processing mechanism, and benchmark these methods against a fine-tuned BioClinicalBERT model. Performance is assessed using a multi-level evaluation framework that combines exact matching, fuzzy lexical matching, and semantic assessment via an LLM-based judge. The results reveal clear tradeoffs across approaches: prompting achieves strong semantic correctness with minimal supervision, retrieval augmentation improves terminology consistency without expanding extraction coverage, and supervised fine-tuning yields the highest overall accuracy when labeled data are available. Across all methods, we observe a consistent 40−50% gap between exact-match and semantic correctness, highlighting the limitations of string-based metrics for clinical Natural Language Processing (NLP). These findings provide practical guidance for selecting extraction strategies under varying resource constraints and emphasize the importance of evaluation methodologies that reflect clinical equivalence rather than surface-form similarity.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper