What question did this study set out to answer?

The study aimed to evaluate and compare treatment plans for stage two hypertension generated by various large language models (LLMs).

April 22, 2026Open Access

Assessment of Stage Two Hypertension Treatment Plans Written by Generative AI

Key Points

The study aimed to evaluate and compare treatment plans for stage two hypertension generated by various large language models (LLMs).
Ten different LLMs were prompted to generate hypertension treatment plans.
Six blinded reviewers scored responses on adherence to guidelines, detail/clarity, and reliability/safety.
Statistical analysis was performed using the Kruskal–Wallis test.
Perplexity received the highest composite score of 8.17 out of 9.
Dyna AI had the lowest overall score of 3.75 out of 9.
ChatGPT scored highest in adherence to guidelines with 2.75 out of 3.

Abstract

Background/Objectives: As use of large language models (LLMs) in clinical practice, in medical education, and by patients increases, it is essential to ensure that information provided is accurate and safe. Our objective was to compare stage two hypertension treatment plans generated by popular LLMs. Methods: ChatGPT (GPT-4o), Claude (Claude 4 Sonnet), ClinicalKey AI, Microsoft Copilot (Wave 2), DeepSeek-V3-0324, Dyna AI, Google Gemini (2.5 Flash), Grok (version 3), Meta AI assistant (Llama 4 Maverick), OpenEvidence (version 2.0), Perplexity (Sonar backend model), and Pi (Inflection-2.5) were prompted to generate a treatment plan for stage two hypertension. Six blinded reviewers scored each response in three domains: adherence to clinical guidelines, detail/clarity, and reliability/safety. Results: Perplexity received the highest composite score (8.17 out of 9), followed by OpenEvidence (7.92 out of 9). Dyna AI had the lowest overall score (3.75 out of 9). Perplexity (3.00 out of 3), Grok (2.83 out of 3), and OpenEvidence (2.75 out of 3) had the highest scores for detail/clarity, while Dyna AI had the lowest for both detail/clarity (1.00 out of 3) and reliability/safety (1.00 out of 3). ChatGPT had the highest score for adherence to guidelines (2.75 out of 3) while Pi had the lowest (1.58 out of 3). Kruskal–Wallis test showed p < 0.05 across sub-score domains and composite scores. Conclusions: LLMs tended to adhere to clinical guidelines and provide detailed responses but often did not provide sources or instruct users to see a healthcare professional. There was notable variability in quality, and medicine-specific LLMs were not superior to popular LLMs.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper