Abstract Here we conducted a retrospective evaluation of an electronic medical record-embedded large language model clinical decision support system deployed across 16 primary care clinics in Kenya, between July and September 2024. A panel of trained physicians reviewed 1,469 records. Hallucinations were uncommon, occurring in 50 encounters (3.4%, 95% confidence interval (CI) 2.5–4.5), and most often involved misexpanded acronyms or drug names. Clinical management guidance aligned with local guidelines in almost all cases (1,455; 99%, 95% CI 98.4–99.5). Despite this, clinicians did not modify documentation in 917 encounters (62%, 95% CI 59.9–64.9). Safety assessments identified actively harmful recommendations from the large language model in 115 encounters (7.8%, 95% CI 6.5–9.3), with 67 such recommendations appearing in the final documentation. Conversely, risk present in the clinician’s initial notes was fully mitigated in 118 encounters (8.0%, 95% CI 6.7–9.5 overall; 12.1%, 95% CI 9.5–15.2 of amended cases). Overall, the tool showed strong potential to support quality improvement, but the asymmetric adoption of harmful versus beneficial outputs underscores the need for usability optimization, local guardrails and prospective trials to confirm patient-level benefit.
Agweyu et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: