What question did this study set out to answer?

The aim is to enhance the accuracy of suicide risk models by improving the detection of EHR reported death dates.

January 23, 2026Open Access

Developing a Natural Language Processing Strategy to Avoid Biased Data in Electronic Health Record Suicide Risk Modeling

Key Points

The aim is to enhance the accuracy of suicide risk models by improving the detection of EHR reported death dates.
Selected Veterans Affairs patients who died by suicide with EHR data from 2017–2018.
Extracted and analyzed 9127 interval EHR texts from the 5 days before reported death.
Developed code to identify texts as pre- or post-death based on their content.
Identified 1742 texts entered on the reported death date.
Found 274 texts entered after the death date.
Retained 60.9% of interval data, improving the detection of valid EHR entries before death.

Abstract

Objective Unstructured electronic health record (EHR) data is increasingly used to enhance suicide risk modeling. Unfortunately, EHR reported death dates are frequently inaccurate. Including EHR data from after patients' deaths, or after suicidal actions which led to their deaths, potentially biases suicide prediction models. In contrast to prior methods which withheld all data from 5‐day before reported death, this study investigates using natural language processing to improve the accuracy of detecting EHR reported death dates. Methods We selected all Veterans Affairs patients who died by suicide with EHR data during 5‐day before reported death date ( n = 1620) during 2017–2018 and extracted all interval EHR texts (texts = 9127). We randomly sub‐selected corpus to develop code to identify if texts were written before or after death or suicidal action and utilized this approach in our full corpus. Results In the full corpus, we identified 1742 texts entered on reported death date, 274 texts after death date, and 1556 texts that did not reference death or suicidal action but were entered chronologically after other texts indicating death. In contrast to the prior method, which excluded all interval texts, our derived approach retained 60.9% of interval data. Conclusions Our approach improved detection of valid EHR data in the interval before patient death. Relevance to clinical practice: This study operationalizes a method to detect immediate pre‐mortem EHR data that could contribute to less bias in suicide risk modeling. This utilization can improve risk prediction and in turn bolster prevention services.

Developing a Natural Language Processing Strategy to Avoid Biased Data in Electronic Health Record Suicide Risk Modeling

Key Points

Abstract

Cite This Study