Abstract Objectives Medical product safety surveillance efforts, whether using electronic health record (EHR) or claims data, typically rely on structured codes. Utilizing unstructured EHR data, particularly information extracted from clinical text through natural language processing (NLP), enriches information available for data mining, phenotyping, and surveillance. To assess overlapping and distinct information across structured and unstructured EHR data, we mapped both to a common vocabulary (Medical Dictionary for Regulatory Activities, MedDRA). We assess the feasibility of implementing such a mapping and explored similarities and differences at multiple levels of the concept hierarchy. Materials and Methods We randomly sampled 15,000 encounters (5000 each from ambulatory, emergency, and inpatient encounters). For each encounter, we extracted MedDRA concepts from clinical notes using MetaMap and mapped structured ICD-10-CM diagnoses to MedDRA. We evaluated corroboration between data sources across the MedDRA hierarchy, as well as the unique information contributed by each source. Results We processed 119,492 clinical notes and mapped 163,254 ICD-10-CM codes to MedDRA. Most encounters (73–98%) had some overlap between MedDRA preferred terms identified from structured and unstructured data. Among MedDRA concepts found in unstructured text, 80–95% were not found in the encounter’s associated ICD-10-CM coded data. Discussion and Conclusion While MedDRA concepts from structured data were mostly corroborated by those extracted from unstructured clinical text, the majority of MedDRA concepts recognized in each encounter were only mentioned in text. Leveraging MedDRA-encoded unstructured text can provide a more comprehensive clinical picture of patients and complement the structured data traditionally used in epidemiological and pharmacovigilance studies.
Smith et al. (Wed,) studied this question.