What question did this study set out to answer?

To develop a large language model (LLM) pipeline for accurately extracting drug-event pairs from electronic health records (EHR).

June 14, 2026Open Access

A Large Language Model for Extracting Post-marketing Adverse Drug Events from Clinical Notes in the Electronic Health Record

Key Points

To develop a large language model (LLM) pipeline for accurately extracting drug-event pairs from electronic health records (EHR).
Developed a two-pass workflow utilizing OpenAI o1 for ADE extraction.
Sample included 372 deidentified clinical notes, reviewed by medical experts for accuracy and characterizing ADEs.
Created a gold standard set from a subset of ADEs to calculate recall and accuracy metrics.
Of 191 identified ADEs, 180 were true positives, achieving 94.2% precision and 84.1% recall (F1 = 88.9%).
Accuracy of seriousness was 100% and label status was 93.9%; coding correct in 92.5% cases.
Identified non-mentioned ADEs, with 12.2% meeting FDA serious criteria and inference costs per note at USD $0.18.

Abstract

BACKGROUND: Clinical notes contain a vast amount of potentially useful information about adverse drug event (ADE) signals that never reach pharmacovigilance databases. Traditional rule-based or sentence-level models often miss subtle causal cues and generate excess false positives. OBJECTIVE: To build a large language model (LLM) pipeline that reads entire electronic health record (EHR) notes, identifies drug-event pairs with a "reasonable possibility" of causation, and infers other important properties of each ADR, such whether the event is "serious" or "unlabeled". METHODS: We adopted a two-pass workflow using model "OpenAI o1": Pass 1 screens each note for ADEs; Pass 2 adds 20 structured fields. A diverse sample of 372 deidentified notes from physicians and pharmacists at the University of California, San Francisco (UCSF), from 31 specialty/setting cells, yielded 191 ADEs. One medical expert reviewed each ADE for validity, seriousness, and label status. Another expert created a gold standard manually curated ADE set on 100 of the 372 ADEs to give us a percent "recall" estimate. A third expert met with the first expert to arrive at a consensus on the validity of LLM ADEs validated by the first expert but not found in the gold standard ADEs, giving us a estimate of "accuracy". RESULTS: Of 191 ADEs, 180 were true positives (94. 2% precision) with 84. 1% recall (F1 = 88. 9%). Seriousness was correct in 100% and label status in 93. 9% of cases. Medical Dictionary for Regulatory Activities (MedDRA) lowest level term (LLT) coding was correct in 92. 5% of valid ADEs; errors were mostly non-existent LLTs. Of all valid ADEs, 12. 2% met FDA "serious" criteria, 15. 0% were unlabeled, and 8. 9% were "failure of efficacy. " On the first pass, 84. 9% of notes contained no ADEs, keeping inference costs to USD 0. 18 per note and 0. 35 per validated ADE. The model inferred some ADEs not mentioned by physicians, e. g. , tacrolimus-associated hypomagnesemia. CONCLUSIONS: While the LLM evaluated in this study is not perfect, it can transform free-text EHR notes into ADEs with 94% accuracy, and such data, when statistically analyzed in aggregate, can lead to new safety signals of potential drug side effects. Integrated with platforms like Sentinel in the USA, or Darwin EU, in the European Union, this approach could rapidly surface rare, serious, and unlabeled ADEs for further regulatory analysis.

AI에게 질문

Bookmark

View Full Paper