What question did this study set out to answer?

This research aims to evaluate the performance of six AI chatbots in providing accurate and reliable information for chronic cough according to clinical guidelines.

April 17, 2026Open Access

Accuracy, reliability, readability, and European respiratory society guideline consistency of six generative artificial intelligence chatbots in providing health advice for chronic cough: A cross-sectional comparative assessment

Puntos clave

This research aims to evaluate the performance of six AI chatbots in providing accurate and reliable information for chronic cough according to clinical guidelines.
Conducted a performance evaluation of six AI chatbots: ChatGPT-4o, ChatGPT-5, DeepSeek V3, Copilot, Gemini 2.5 flash, and Perplexity.
Developed 25 high-frequency chronic cough queries based on Google Trends and Chinese online health communities.
Two clinical experts assessed the chatbot responses for accuracy and adherence to European Respiratory Society guidelines.
Evaluated reliability using DISCERN, EQIP, JAMA, and GQS metrics; readability assessed via Flesch–Kincaid and other metrics.
Perplexity achieved the highest reliability scores, while Copilot had the lowest.
Overall accuracy of the chatbots was 80.39%, but critical clinical details were often missing.
No chatbot reached the recommended 6th-grade reading level, indicating readability issues.

Resumen

Background Advancements in artificial intelligence (AI) have markedly improved healthcare accessibility, providing patients with immediate medical information via chatbots. Individuals with chronic cough often seek support through online resources; however, unregulated tool use raises concerns regarding misinformation, safety risks, and clinical guideline deviations. Therefore, critically evaluating chatbot-provided information on chronic cough is crucial. Objective To conduct a performance evaluation of six AI chatbots—ChatGPT-4o, ChatGPT-5, DeepSeek V3, Copilot, Gemini 2.5 flash, and Perplexity—in responding to high-frequency chronic cough queries, with respect to accuracy, reliability, readability, and clinical guideline adherence. Methods Based on an inductive analysis of Google Trends and Chinese online health communities, 25 queries were formulated. Two clinical experts evaluated the responses for accuracy, supplementarity, and incompleteness, following the European Respiratory Society (ERS) chronic cough guidelines. Reliability was assessed using DISCERN, EQIP, JAMA, and GQS, while readability was measured via six standard metrics, including the Flesch–Kincaid Grade Level. Results Perplexity achieved the highest reliability scores out of the tested models (DISCERN: 51.00±3.94; EQIP: 69.40±6.34), while Copilot recorded the lowest (DISCERN: 37.60±4.19; EQIP: 52.40±6.94; pairwise P <0.001vs. Perplexity). Although Copilot demonstrated comparatively better readability, no model achieved the recommended 6th-grade reading level. Pooled accuracy reached 80.39%, but critical clinical details were frequently omitted across all models. Conclusion While AI chatbots offer accessible health advice for chronic cough, their usefulness is constrained by significant deficiencies in readability and reliability. Widely used tools such as Copilot systematically omit guideline-based content, potentially introducing safety risks. Future research should explore whether enhanced chatbots can safely support patient decision-making and evaluate their real-world clinical applicability.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo