BACKGROUND: Clinical notes contain a vast amount of potentially useful information about adverse drug event (ADE) signals that never reach pharmacovigilance databases. Traditional rule-based or sentence-level models often miss subtle causal cues and generate excess false positives. OBJECTIVE: To build a large language model (LLM) pipeline that reads entire electronic health record (EHR) notes, identifies drug-event pairs with a "reasonable possibility" of causation, and infers other important properties of each ADR, such whether the event is "serious" or "unlabeled". METHODS: We adopted a two-pass workflow using model "OpenAI o1": Pass 1 screens each note for ADEs; Pass 2 adds 20 structured fields. A diverse sample of 372 deidentified notes from physicians and pharmacists at the University of California, San Francisco (UCSF), from 31 specialty/setting cells, yielded 191 ADEs. One medical expert reviewed each ADE for validity, seriousness, and label status. Another expert created a gold standard manually curated ADE set on 100 of the 372 ADEs to give us a percent "recall" estimate. A third expert met with the first expert to arrive at a consensus on the validity of LLM ADEs validated by the first expert but not found in the gold standard ADEs, giving us a estimate of "accuracy". RESULTS: Of 191 ADEs, 180 were true positives (94. 2% precision) with 84. 1% recall (F1 = 88. 9%). Seriousness was correct in 100% and label status in 93. 9% of cases. Medical Dictionary for Regulatory Activities (MedDRA) lowest level term (LLT) coding was correct in 92. 5% of valid ADEs; errors were mostly non-existent LLTs. Of all valid ADEs, 12. 2% met FDA "serious" criteria, 15. 0% were unlabeled, and 8. 9% were "failure of efficacy. " On the first pass, 84. 9% of notes contained no ADEs, keeping inference costs to USD 0. 18 per note and 0. 35 per validated ADE. The model inferred some ADEs not mentioned by physicians, e. g. , tacrolimus-associated hypomagnesemia. CONCLUSIONS: While the LLM evaluated in this study is not perfect, it can transform free-text EHR notes into ADEs with 94% accuracy, and such data, when statistically analyzed in aggregate, can lead to new safety signals of potential drug side effects. Integrated with platforms like Sentinel in the USA, or Darwin EU, in the European Union, this approach could rapidly surface rare, serious, and unlabeled ADEs for further regulatory analysis.
Ludwig et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: