Abstract Background Patients and their families without medical knowledge may find professional health care information difficult to understand. The use of large language models (LLMs) to simplify and translate complex medical content holds promise for improving comprehension while reducing the burden on health care providers tasked with delivering explanations. Objective This study aims to evaluate the quality of information leaflets generated using commercially available LLMs. Methods Informational texts on post–intensive care syndrome were generated using 6 different LLMs and 4 prompt designs with varying levels of instructional guidance. Clinical practice guideline documents were uploaded and provided to the models as reference context, reflecting a pragmatic clinical scenario without model tuning or advanced retrieval pipelines. In total, 72 texts were generated (6 models × 4 prompts × 3 outputs). After excluding texts shorter than 500 characters (n=16) and those without explicit mention of post–intensive care syndrome (n=3), 53 texts remained. To enable balanced human evaluation across model-prompt combinations, the longest eligible response from each pair was selected (4 prompts × 4 models; n=16). Following independent expert review by 2 medical specialists, 7 texts were excluded, leaving 9 texts for final analysis. Ten individuals, including health care professionals and nonmedical personnel, assessed the texts using a 10-point Likert scale across multiple quality domains. An LLM-based parallel assessment was also conducted, and scores were compared across models and evaluator groups. Results In the human evaluation of the selected 9 texts, the generated texts achieved an average score of 6.8 or higher across all evaluation criteria, with no potentially harmful content identified. The text generated by LLaMA 3 70B, using a step-by-step approach combined with text-augmented prompting based on clinical guidelines, received the highest overall score, whereas the lowest-rated text was produced using a simple prompt without text augmentation. Although no consistent trends were observed across LLMs or prompt engineering strategies, text-augmented prompting was generally associated with higher evaluation scores. Ratings differed between professional and nonprofessional evaluators. Given the feasibility-driven screening process and the resulting limited sample size, the findings should be interpreted as exploratory and descriptive rather than definitive estimates of overall model performance. Conclusions Among the selected texts included in the final human evaluation, informational materials generated using commercially available LLMs were generally rated as acceptable by human evaluators, and none contained harmful content. These findings suggest that LLMs may support the development of patient-facing informational materials under feasibility-constrained conditions, although further validation with larger and more diverse samples is warranted.
Hata et al. (Thu,) studied this question.