What question did this study set out to answer?

This study evaluates the quality of information leaflets generated by large language models for post-intensive care syndrome.

May 16, 2026Open Access

Use of Commercially Available Large Language Models to Generate Information Leaflets on Post–Intensive Care Syndrome: Clinical Utility Assessment

Key Points

This study evaluates the quality of information leaflets generated by large language models for post-intensive care syndrome.
Generated informational texts using 6 different large language models and 4 prompt designs.
Conducted expert reviews and evaluations by 10 individuals using a 10-point Likert scale.
Selected and analyzed 9 eligible texts after exclusions based on character count and content relevance.
Generated texts averaged scores of 6.8 or higher on evaluation criteria without any harmful content.
LLaMA 3 70B received the highest score using a step-by-step and text-augmented prompting approach.
Evaluation scores varied between healthcare professionals and non-medical personnel, indicating differences in assessment perspectives.

Abstract

Abstract Background Patients and their families without medical knowledge may find professional health care information difficult to understand. The use of large language models (LLMs) to simplify and translate complex medical content holds promise for improving comprehension while reducing the burden on health care providers tasked with delivering explanations. Objective This study aims to evaluate the quality of information leaflets generated using commercially available LLMs. Methods Informational texts on post–intensive care syndrome were generated using 6 different LLMs and 4 prompt designs with varying levels of instructional guidance. Clinical practice guideline documents were uploaded and provided to the models as reference context, reflecting a pragmatic clinical scenario without model tuning or advanced retrieval pipelines. In total, 72 texts were generated (6 models × 4 prompts × 3 outputs). After excluding texts shorter than 500 characters (n=16) and those without explicit mention of post–intensive care syndrome (n=3), 53 texts remained. To enable balanced human evaluation across model-prompt combinations, the longest eligible response from each pair was selected (4 prompts × 4 models; n=16). Following independent expert review by 2 medical specialists, 7 texts were excluded, leaving 9 texts for final analysis. Ten individuals, including health care professionals and nonmedical personnel, assessed the texts using a 10-point Likert scale across multiple quality domains. An LLM-based parallel assessment was also conducted, and scores were compared across models and evaluator groups. Results In the human evaluation of the selected 9 texts, the generated texts achieved an average score of 6.8 or higher across all evaluation criteria, with no potentially harmful content identified. The text generated by LLaMA 3 70B, using a step-by-step approach combined with text-augmented prompting based on clinical guidelines, received the highest overall score, whereas the lowest-rated text was produced using a simple prompt without text augmentation. Although no consistent trends were observed across LLMs or prompt engineering strategies, text-augmented prompting was generally associated with higher evaluation scores. Ratings differed between professional and nonprofessional evaluators. Given the feasibility-driven screening process and the resulting limited sample size, the findings should be interpreted as exploratory and descriptive rather than definitive estimates of overall model performance. Conclusions Among the selected texts included in the final human evaluation, informational materials generated using commercially available LLMs were generally rated as acceptable by human evaluators, and none contained harmful content. These findings suggest that LLMs may support the development of patient-facing informational materials under feasibility-constrained conditions, although further validation with larger and more diverse samples is warranted.

Mark Helpful

Bookmark

Relay

View Full Paper