Background/Objectives: As use of large language models (LLMs) in clinical practice, in medical education, and by patients increases, it is essential to ensure that information provided is accurate and safe. Our objective was to compare stage two hypertension treatment plans generated by popular LLMs. Methods: ChatGPT (GPT-4o), Claude (Claude 4 Sonnet), ClinicalKey AI, Microsoft Copilot (Wave 2), DeepSeek-V3-0324, Dyna AI, Google Gemini (2.5 Flash), Grok (version 3), Meta AI assistant (Llama 4 Maverick), OpenEvidence (version 2.0), Perplexity (Sonar backend model), and Pi (Inflection-2.5) were prompted to generate a treatment plan for stage two hypertension. Six blinded reviewers scored each response in three domains: adherence to clinical guidelines, detail/clarity, and reliability/safety. Results: Perplexity received the highest composite score (8.17 out of 9), followed by OpenEvidence (7.92 out of 9). Dyna AI had the lowest overall score (3.75 out of 9). Perplexity (3.00 out of 3), Grok (2.83 out of 3), and OpenEvidence (2.75 out of 3) had the highest scores for detail/clarity, while Dyna AI had the lowest for both detail/clarity (1.00 out of 3) and reliability/safety (1.00 out of 3). ChatGPT had the highest score for adherence to guidelines (2.75 out of 3) while Pi had the lowest (1.58 out of 3). Kruskal–Wallis test showed p < 0.05 across sub-score domains and composite scores. Conclusions: LLMs tended to adhere to clinical guidelines and provide detailed responses but often did not provide sources or instruct users to see a healthcare professional. There was notable variability in quality, and medicine-specific LLMs were not superior to popular LLMs.
Metzger et al. (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: