Objectives To investigate the performance of commercially available Clinical Artificial Intelligence Scribes (CAISs), assessing their accuracy, potential clinical impact of errors, and documentation quality, given growing concerns around errors and safety. Methods and analysis Seven CAIS products were investigated, using eight standardised clinical consultation scenarios recorded as audio. CAIS-generated summaries were assessed against a human-validated transcript and evaluated for errors (omissions, factual inaccuracies and hallucinations). Error severity was rated by medical doctors, generating a novel severity-weighted mpact Score (linear and exponential variants), to quantify potential clinical impact. Further analysis using the Physician Documentation Quality Instrument (PDQI-10) (a validated clinical note quality score) reinforced the findings. Results Omissions dominated error counts (83.8%, p<<0.001), with CAISs varying widely in error frequency and severity, and a median of 1–6 omissions per consultation (depending on CAIS). Although less frequent, hallucinations and factual inaccuracies were more often clinically serious. No tested CAIS produced error-free summaries. The Impact Score highlighted clinical severity, notably amplifying the significance of less frequent but high-severity errors. PDQI-10 analysis indicated summaries were weakest in succinctness and organisation, but strong in consistency and clinical usefulness. Conclusions The CAISs demonstrate high levels of summarisation accuracy. However, there is great disparity between the currently available CAIS products and, while some perform well, none are perfect. Clinicians should therefore maintain vigilance, particularly checking omitted psychosocial details and medications, and scrutinising plausible-sounding insertions. Purchasers and regulators should be aware of the significant performance disparities identified, reinforcing the need for careful evaluation and selection of CAIS products.
Building similarity graph...
Analyzing shared references across papers
Loading...
Thomas C. Draper
Timothy M. Cox
Kathryn Lamb-Riddell
University of the West of England
Taunton & Somerset NHS Foundation Trust
NHS England
Building similarity graph...
Analyzing shared references across papers
Loading...
Draper et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68dc26188a7d58c25ebb28e6 — DOI: https://doi.org/10.1136/bmjdhai-2025-000092