We would like to comment on “A Longitudinal Analysis of the Usefulness, Readability, Consistency, and Capacity of Artificial Intelligence Chatbot Responses Regarding the Reality of Chronic Pain in Children 1.” This study's merits include its longitudinal benchmarking design and the use of 10 preset evaluation criteria. However, methodological restrictions must be recognized. First, employing only one primary prompt, even if repeated 10 times per system, may not effectively reflect the model's capabilities in the wide range of query situations encountered by children patients with chronic pain. Prompt sensitivity could have a huge influence. Second, while readability is measured using the Flesch–Kincaid Grade Level, a quantitative measure, the “usefulness” and “empathetic tone” criteria remain subjective and do not report inter-evaluator consistency. Furthermore, the fact that all systems acquire a perfect 10/10 score after being given explicit scoring criteria may indicate a ceiling effect, diminishing the ability to distinguish across models. Reinterpreted, the advances observed between 2024 and 2025 may reflect not only the “innate progress” of the technology, but also systemic modifications such as alignment training and reinforcement learning based on user feedback. While the use of specific scoring criteria serves as structured prompt engineering, demonstrating that answer quality is based not only on the model but also on the stated prompt framework, the reduction in readability below the elementary level may be a contributing issue. While reflecting on the ability to adapt language to children, it is important to assess whether a lower language level may have an impact on content completion. Ethically and therapeutically, the question “Is it all in my head?” is extremely sensitive in terms of psychological stigma and pain legitimization. The system's ability to respond with increased empathy and evidence-based feedback is a good thing, but the evaluation is confined to the text and does not account for the impact on actual users, such as validation or anxiety reduction. As a result, linguistic quality may not correspond to psychosocial outcome quality. Future research should use a variety of prompts that reflect different scenarios, as well as testing in multi-turn discussion contexts to evaluate long-term consistency and safety. Higher validity and reliability evaluation methods should be created, and their influence on pediatric patients and families in simulated or real-world scenarios should be investigated. Continuous and open benchmarking is critical in the age of fast-growing AI systems to enable their safe and responsible inclusion into pediatric practice. Hinpetch Daungsupawong: 50% ideas, writing, analyzing, approval. Viroj Wiwanitkit: 50% ideas, supervision, approval. The authors use language editing computational tool in preparation of the article. The authors have nothing to report. The authors declare no conflicts of interest. Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
Daungsupawong et al. (Thu,) studied this question.