May 29, 2026Open Access

Comparative evaluation of the quality, reliability, and readability of five large language model responses to frequently asked questions on gestational hypertension

Key Points

Key points are not available for this paper at this time.

Abstract

Objective This study aims to clarify the impact of Large Language Models (LLMs) and health education content categories on generated text quality (patient education appropriateness and overall quality) and readability, providing empirical evidence for the standardized application of LLMs-assisted health communication. Methods Five mainstream models (Doubao, Deep Seek, Wenxin Yiyan, Gemini and GPT-5) were selected to generate 100 texts (20 per model, 20 per theme) across five health education categories: disease cognition dimension, etiology and risk factors dimension, diagnosis and examination dimension, treatment and management dimension, and prevention and prognosis dimension. Test quality was assessed using the Chinese version of the Patient Education Material Readability Assessment Scale (C-PEMAT) and the Global Quality Scale (GQS), while readability was measured via seven metrics including the Automated Readability Index (ARI) and the Flesch Reading Ease Score (FRES). Correlation analyses were used to explore relationships among indicators. Results Our analysis revealed clear hierarchical performance across five large language models: GPT-5 achieved the highest scores in both patient education appropriateness (C-PEMAT: 11.10 ± 2.40) and overall text quality (GQS: 5.00 4.00, 5.00). GPT-5 exhibited significantly higher GQS scores than all other models ( χ 2 = 66.52, p 0.001), while Wenxin Yiyan ranked lowest in core quality (GQS: 1.00 1.00, 2.00). Content categories exhibited differentiated readability but stable quality: texts on “Prevention and Prognosis” and “Treatment and Management” yielded the highest C-PEMAT scores, whereas “Etiology and Risk Factors” texts showed weaker reading fluency. Correlation analysis confirmed that quality and readability were largely independent, though subtle associations emerged—including a weak positive link between FRES and GQS. In the factual-accuracy assessment, 19.0% of responses contained factual inaccuracies, while no response was judged to contain potentially clinically harmful misinformation. Significant between-model differences were observed in factual accuracy scores. Conclusion This study demonstrates significant hierarchical performance among LLMs in health science text creation. Different health education themes show partial indicator variation but stable overall quality. Notably, quality and readability are relatively independent (with weak correlations), providing empirical evidence for understanding LLMs in health popularization.

Bookmark

View Full Paper

Cite This Study

Liu et al. (Fri,) studied this question.

synapsesocial.com/papers/6a20d4ee6dd54ee3d3eb0c92 https://doi.org/https://doi.org/10.3389/fpubh.2026.1833611

Bookmark

View Full Paper