OBJECTIVES: Large language models (LLMs) using a retrieval-augmented generation (RAG) approach have the ability to respond to user queries with answers grounded in specific sources. We conducted an exploratory evaluation of the accuracy of a RAG-based LLM to provide care recommendations for prehospital scenarios based on the emergency medical services (EMS) policies and treatment protocols (TPs). METHODS: We conducted a non-human, simulation-based experimental study by uploading all text-based policies/TPs from a single large EMS system into Google's NotebookLM platform, which uses a RAG-based LLM (Gemini 2.5 Flash) framework to generate grounded responses. We developed six clinical scenario prompts, including adult patient scenarios (i.e., ventricular fibrillation out-of-hospital cardiac arrest OHCA, blunt head trauma, stroke, hazardous materials exposure mass-casualty incident) and pediatric patient scenarios (i.e., pulseless electrical activity OHCA, traumatic penetrating extremity hemorrhagic shock). For each scenario, we used all relevant policies/TPs to create a specific set of expected patient care actions. We categorized actions as procedures/interventions, medications, and destination guidance. Medication grading included dose/route for all patients and weight-based dosing for pediatrics. After providing the LLM with the prompts, two investigators independently graded the LLM responses and evaluated for LLM "hallucinations". Missing actions were categorized by investigators based on applicability to the case and potential safety risk (e.g., 'non-applicable,' 'minor miss,' 'major miss'). The primary outcome was model recommendation accuracy, defined as the percentage of all actions correctly provided in the model's response. We reported descriptive statistics. RESULTS: The LLM recommended 127 (75%) of 169 patient care actions across all cases. There were 42 missed actions. Nine of the 169 actions (5%) were categorized as 'major misses,' 13 (8%) as 'minor misses', and 20 (12%) as non-applicable to the specific case. Five of nine major misses occurred during the pediatric OHCA case; the majority of these resulted from failure to prompt for evaluation of secondary treatable causes. We identified 12 hallucinations; none were judged to endanger patient safety. CONCLUSION: We found that a RAG-based LLM demonstrated 75% accuracy across various prehospital scenarios when providing responses grounded in the policies/TPs of a single large EMS agency.
Wang et al. (Thu,) studied this question.