What question did this study set out to answer?

Assess the accuracy and safety risks of an AI decision-support system for surgical antimicrobial prophylaxis.

June 12, 2026

Pre‐implementation safety evaluation of an AI decision‐support system for surgical antimicrobial prophylaxis

Key Points

Assess the accuracy and safety risks of an AI decision-support system for surgical antimicrobial prophylaxis.
Evaluated a large language model across 20 simulated surgical scenarios using guideline text without fine-tuning.
Recommendations for agent, dose, timing, re-dosing, and guidelines were independently assessed by two clinicians.
Used a modified NCC MERP Index for classifying harm and assessing accuracy across five domains.
Clinically significant prescribing risk found in 10% of cases, with notable issues in anaerobic coverage and re-dosing.
Overall guideline concordance was 4/5 with perfect dose accuracy, but timing and citation accuracy were lower at 70% and 45%, respectively.

Abstract

Abstract Background Large language models (LLMs) show potential to support antimicrobial prescribing but require simulation‐based, institution‐specific safety evaluation prior to any consideration of clinical use. In Australia, antimicrobial prescribing represents a high‐risk domain for digital decision‐support systems due to patient safety and antimicrobial resistance implications. Aim To characterise prescribing accuracy, error phenotypes and antimicrobial stewardship risk associated with a LLM that was provided with publicly available surgical prophylaxis guidelines during inference (without fine‐tuning or model modification) across 20 simulated surgical scenarios. Methods Twenty simulated surgical scenarios were tested using a LLM that was prompt‐conditioned with publicly available guideline text during inference, without any fine‐tuning or modification of model weights. For each case, the model generated recommendations for agent, dose, timing, re‐dosing and guideline citation. Outputs were independently assessed by two local clinicians familiar with the guideline, with accuracy scored across five domains and harm classified using a modified National Coordinating Council for Medical Error Reporting and Prevention (NCC MERP) Index. Results Clinically significant antimicrobial prescribing risk was identified in 10% of simulated scenarios (2/20), recognising wide confidence intervals due to the small sample size. These included omission of required anaerobic coverage and failure to redose prophylaxis in prolonged procedures. Overall guideline concordance was 4/5, with perfect dose accuracy but lower performance for timing (70%) and guideline citation (45%). Conclusions This study demonstrates the feasibility of constructing institutionally governed, guideline‐based AI systems while identifying stewardship‐relevant safety risks that currently preclude clinical use without further validation.

Mark Helpful

Bookmark

Relay