What question did this study set out to answer?

This research evaluates the effectiveness of AI-generated plain language summaries compared to human-written ones.

March 4, 2026Open Access

Readability and quality assessment of human versus artificial intelligence-generated plain language summaries across six large language models

Key Points

This research evaluates the effectiveness of AI-generated plain language summaries compared to human-written ones.
Compared 30 human-written PLS with 180 generated by six LLMs.
Assessed readability using Flesch Reading Ease and Flesch–Kincaid Grade Level.
Evaluated clarity, inclusiveness, interpretation, and factual accuracy by three independent reviewers.
Analyzed differences using one-way ANOVA with post hoc Tukey testing.
LLM-generated summaries were significantly more readable than human-written summaries across all metrics (P < 0.001).
Human-written summaries had marginally higher factual accuracy but quality ratings did not differ significantly.
Gemini produced the simplest text, while Meta AI showed the best balance of readability and quality.

Abstract

Background: Plain language summaries (PLS) aim to make scientific research understandable to non-specialists, yet producing clear and accurate summaries remains challenging, especially for non-native English writers. With advances in large language models (LLMs), automated summarisation offers a potential solution. However, few studies have directly compared human-written PLS with outputs from multiple LLMs using the same dataset. Aims: To assess whether LLMs can support or augment human efforts to produce effective and accessible scientific communication. Materials and Methods: In this cross-sectional study, 30 human-written PLS were compared with 180 PLS generated by six LLMs. Readability was assessed using Flesch Reading Ease, Flesch–Kincaid Grade Level, sentence length and syllables per word. Three independent reviewers evaluated clarity, inclusiveness, interpretation and factual accuracy. Group differences were analysed using one-way ANOVA with post hoc Tukey testing. Results: LLM-generated summaries were significantly more readable than human-written summaries across all metrics ( P < 0.001). Human-authored PLS showed marginally higher factual accuracy, though overall reviewer-rated quality did not differ significantly. Among the LLMs, Gemini produced the simplest text, whereas Meta Artificial intelligence demonstrated the best balance of readability and quality. Conclusion: LLMs can generate PLS that are comparable in quality to human-written summaries while offering substantially improved readability. These tools may enhance accessibility for diverse audiences, though human oversight remains essential to ensure contextual accuracy and interpretive depth.

Readability and quality assessment of human versus artificial intelligence-generated plain language summaries across six large language models

Key Points

Abstract

Cite This Study