May 1, 2025Open Access

Evaluating artificial intelligence chatbots’ responses to gynecomastia inquiries: Comparative study of information quality, readability, and guideline consistency

XSXinran ShaoLiaoning Provincial People's Hospital TRTing RuanFu Jen Catholic University XJXingai JuLiaoning Provincial People's Hospital

Key Points

AI chatbots may offer immediate support for gynecomastia, but variability in quality raises concerns.
Performance assessment revealed Copilot's low DISCERN score, while DeepSeek excelled in EQIP scoring, indicating quality gaps.
Readability analysis showed ChatGPT's high FKGL score correlating with low FKRE score, suggesting poor accessibility of content.
Overall guideline consistency for AI responses was 85.71%, but many key details were frequently omitted.

Abstract

Background With the rapid development of artificial intelligence (AI) technologies, AI chatbots have been widely applied in the healthcare to provide patients with immediate information. Many people feel embarrassed to discuss gynecomastia in person and turn to online resources for support. Objective This study aims to fill this gap by evaluating the performance of five popular AI chatbots (ChatGPT, DeepSeek, Gemini, Perplexity, and Copilot) in answering questions about gynecomastia, focusing on their reliability, quality, readability, and guideline consistency. Methods In this study, the top 25 gynecomastia-related queries searched globally from 2004 to 2025 were retrieved from Google Trends and input into five AI chatbots for responses. The reliability and quality of responses were assessed using the DISCERN questionnaire and the Ensuring Quality Information for Patients (EQIP) tool. Readability was analyzed via the Flesch-Kincaid Grade Level (FKGL) and Flesch-Kincaid Reading Ease Score (FKRE). Accuracy, supplementary, and incompleteness were compared with the European Association of Andrology guidelines. Results Copilot had the lowest DISCERN score (median interquartile range (IQR): 41.536.0-45.0), while DeepSeek performed best in EQIP scoring (median IQR: 60.459.0-64.1). For readability, ChatGPT exhibited the highest FKGL score (mean ± standard deviation (SD): 15.1 ± 2.0) but the lowest FKRE score (mean ± SD: 15.1 ± 2.0), indicating the poorest readability. In contrast, DeepSeek achieved the lowest FKGL (mean ± SD: 11.0 ± 1.2), suggesting superior readability. Guideline consistency analysis revealed an overall accuracy of 85.71% for AI responses, but key details were often omitted. Conclusion AI chatbots provide immediate informational support for gynecomastia patients, but there is significant variability in readability and reliability, alongside risks of omitting guideline content.

Ask AI

Helpful

Bookmark

View Full Paper