Key points are not available for this paper at this time.
Artificial intelligence (AI) chatbots are increasingly used to support diabetes self-management, yet their validity and reliability require systematic evaluation. This study aimed to evaluate and compare the validity and reliability of chatbot-generated responses to frequently asked questions in diabetes self-management. Five questions aligned with diabetes self-management parameters (knowledge/diagnosis, partnership in treatment, symptom recognition and management, and coping) were posed to 6 AI chatbots. Two experts assessed the responses using the Global Quality Score. Inter-rater reliability was analyzed using kappa statistics. Validity was evaluated via independent sample t test, Cronbach’s alpha, and intraclass correlation coefficients. Google Gemini showed perfect agreement for both validity/usefulness and reliability (K=1.000, P =.002), as well as test-retest reliability (α=0.929, 86.3% agreement). ChatGPT 4.0 demonstrated perfect inter-rater agreement for validity for the usefulness (α=1.00, 100% agreement; K=1, P .05). However, it showed low reliability for test-retest. All chatbots were generally useful and reliable in symptom recognition and coping domains. Google Gemini provided superior information for diabetes self-management compared with other chatbots. However, due to rapid technological changes, continuous expert evaluations are recommended to ensure accuracy, reliability, usefulness, and ethical compliance.
Dereli et al. (Mon,) studied this question.