What question did this study set out to answer?

To evaluate the accuracy of a retrieval-augmented generation LLM in providing prehospital care recommendations based on EMS protocols.

May 10, 2026

Can a Large Language Model Grounded in Text-Based Agency-Specific Prehospital Protocols Provide Accurate Care Recommendations?

Key Points

To evaluate the accuracy of a retrieval-augmented generation LLM in providing prehospital care recommendations based on EMS protocols.
Non-human, simulation-based experimental study using Google's NotebookLM and Gemini 2.5 Flash framework.
Developed six clinical scenario prompts for adult and pediatric patients, utilizing EMS policies and treatment protocols.
Graded model responses for accuracy and categorized missed actions based on clinical relevance and risk.
LLM recommended 127 of 169 patient care actions (75% accuracy).
42 actions were missed, with 9 classified as major misses and 13 as minor misses.
Major misses predominantly occurred in pediatric OHCA scenarios, often due to lacking prompts for secondary causes.

Abstract

OBJECTIVES: Large language models (LLMs) using a retrieval-augmented generation (RAG) approach have the ability to respond to user queries with answers grounded in specific sources. We conducted an exploratory evaluation of the accuracy of a RAG-based LLM to provide care recommendations for prehospital scenarios based on the emergency medical services (EMS) policies and treatment protocols (TPs). METHODS: We conducted a non-human, simulation-based experimental study by uploading all text-based policies/TPs from a single large EMS system into Google's NotebookLM platform, which uses a RAG-based LLM (Gemini 2.5 Flash) framework to generate grounded responses. We developed six clinical scenario prompts, including adult patient scenarios (i.e., ventricular fibrillation out-of-hospital cardiac arrest OHCA, blunt head trauma, stroke, hazardous materials exposure mass-casualty incident) and pediatric patient scenarios (i.e., pulseless electrical activity OHCA, traumatic penetrating extremity hemorrhagic shock). For each scenario, we used all relevant policies/TPs to create a specific set of expected patient care actions. We categorized actions as procedures/interventions, medications, and destination guidance. Medication grading included dose/route for all patients and weight-based dosing for pediatrics. After providing the LLM with the prompts, two investigators independently graded the LLM responses and evaluated for LLM "hallucinations". Missing actions were categorized by investigators based on applicability to the case and potential safety risk (e.g., 'non-applicable,' 'minor miss,' 'major miss'). The primary outcome was model recommendation accuracy, defined as the percentage of all actions correctly provided in the model's response. We reported descriptive statistics. RESULTS: The LLM recommended 127 (75%) of 169 patient care actions across all cases. There were 42 missed actions. Nine of the 169 actions (5%) were categorized as 'major misses,' 13 (8%) as 'minor misses', and 20 (12%) as non-applicable to the specific case. Five of nine major misses occurred during the pediatric OHCA case; the majority of these resulted from failure to prompt for evaluation of secondary treatable causes. We identified 12 hallucinations; none were judged to endanger patient safety. CONCLUSION: We found that a RAG-based LLM demonstrated 75% accuracy across various prehospital scenarios when providing responses grounded in the policies/TPs of a single large EMS agency.

Bookmark

Can a Large Language Model Grounded in Text-Based Agency-Specific Prehospital Protocols Provide Accurate Care Recommendations?

Key Points

Abstract

Cite This Study