Key points are not available for this paper at this time.
Large Language Models (LLMs) hold promise for medical applications but often lack domain-specific expertise. Retrieval Augmented Generation (RAG) enables customization by integrating specialized knowledge. This study assessed the accuracy, consistency, and safety of LLM-RAG models in determining surgical fitness and delivering preoperative instructions using 35 local and 23 international guidelines. Ten LLMs (e.g., GPT3.5, GPT4, GPT4o, Gemini, Llama2, and Llama3, Claude) were tested across 14 clinical scenarios. A total of 3234 responses were generated and compared to 448 human-generated answers. The GPT4 LLM-RAG model with international guidelines generated answers within 20 s and achieved the highest accuracy, which was significantly better than human-generated responses (96.4% vs. 86.6%, p = 0.016). Additionally, the model exhibited an absence of hallucinations and produced more consistent output than humans. This study underscores the potential of GPT-4-based LLM-RAG models to deliver highly accurate, efficient, and consistent preoperative assessments.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yu He Ke
Liyuan Jin
Kabilan Elangovan
npj Digital Medicine
SHILAP Revista de lepidopterología
Harvard University
Brigham and Women's Hospital
Duke-NUS Medical School
Building similarity graph...
Analyzing shared references across papers
Loading...
Ke et al. (Sat,) studied this question.
www.synapsesocial.com/papers/69dab08285037e71b26849f0 — DOI: https://doi.org/10.1038/s41746-025-01519-z
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: