Background Advancements in artificial intelligence (AI) have markedly improved healthcare accessibility, providing patients with immediate medical information via chatbots. Individuals with chronic cough often seek support through online resources; however, unregulated tool use raises concerns regarding misinformation, safety risks, and clinical guideline deviations. Therefore, critically evaluating chatbot-provided information on chronic cough is crucial. Objective To conduct a performance evaluation of six AI chatbots—ChatGPT-4o, ChatGPT-5, DeepSeek V3, Copilot, Gemini 2.5 flash, and Perplexity—in responding to high-frequency chronic cough queries, with respect to accuracy, reliability, readability, and clinical guideline adherence. Methods Based on an inductive analysis of Google Trends and Chinese online health communities, 25 queries were formulated. Two clinical experts evaluated the responses for accuracy, supplementarity, and incompleteness, following the European Respiratory Society (ERS) chronic cough guidelines. Reliability was assessed using DISCERN, EQIP, JAMA, and GQS, while readability was measured via six standard metrics, including the Flesch–Kincaid Grade Level. Results Perplexity achieved the highest reliability scores out of the tested models (DISCERN: 51.00±3.94; EQIP: 69.40±6.34), while Copilot recorded the lowest (DISCERN: 37.60±4.19; EQIP: 52.40±6.94; pairwise P <0.001vs. Perplexity). Although Copilot demonstrated comparatively better readability, no model achieved the recommended 6th-grade reading level. Pooled accuracy reached 80.39%, but critical clinical details were frequently omitted across all models. Conclusion While AI chatbots offer accessible health advice for chronic cough, their usefulness is constrained by significant deficiencies in readability and reliability. Widely used tools such as Copilot systematically omit guideline-based content, potentially introducing safety risks. Future research should explore whether enhanced chatbots can safely support patient decision-making and evaluate their real-world clinical applicability.
Wu et al. (Sun,) studied this question.