What question did this study set out to answer?

The research aims to assess the effectiveness of different LLMs in generating compliance-focused abstracts for dental articles.

April 26, 2026Open Access

Comparative evaluation of large language models for guideline-compliant abstract generation and readability in dental research: an experimental comparative study

Key Points

The research aims to assess the effectiveness of different LLMs in generating compliance-focused abstracts for dental articles.
Evaluated three LLMs (ChatGPT, Claude, Gemini) on 75 PubMed articles using standardized prompts.
Assessed abstracts against guideline checklists (STROBE, PRISMA, CARE) and readability metrics (FRE, FKGL).
Performed statistical analyses using One-way ANOVA with Tukey post-hoc testing.
ChatGPT scored higher in quality than Claude for meta-analyses (p = 0.007).
All LLMs improved readability of abstracts vs. original texts (FRE: p < 0.001; FKGL: p < 0.001).
Performance declined for systematic reviews and meta-analyses, emphasizing the need for human oversight.

Abstract

This study aimed to evaluate the performance of three contemporary large language models (LLMs) ChatGPT 5.0, Claude 4.5 Sonnet, and Gemini 2.5 Flash in generating reporting-compliant abstracts of scientific articles in the dental literature. The models were assessed using guideline-based checklists (STROBE-A, PRISMA-A, CARE-A) and two readability metrics (FRE, FKGL). This experimental, within-items comparative study included 75 full-text PubMed articles (25 STROBE, 25 PRISMA, 25 CARE) published between 2020 and 2025. Original abstracts, author information, and metadata were removed before providing full texts to each LLM. All models received an identical standardized prompt, and generated summaries were independently evaluated using 12-item checklists aligned with the corresponding reporting guidelines. Readability was assessed using FRE and FKGL. Statistical analyses were performed using One-way ANOVA with Tukey post-hoc testing. LLM performance varied significantly across study design category. In meta-analyses, ChatGPT achieved higher quality scores than Claude (p = 0.007), while no significant differences were observed among models for case reports or original research (p > 0.05). All LLMs performed better in summarizing case reports and original research compared with meta-analyses (p < 0.05). Readability analysis showed that LLM-generated abstracts were consistently more readable than original abstracts and full texts (FRE: p < 0.001; FKGL: p < 0.001), with Gemini producing the highest readability values across all study design categories. LLMs demonstrated strong summarization performance in structurally simple study design categories case reports and original research while their performance declined for systematic reviews and meta-analyses. Although LLM-generated summaries substantially improved readability, guideline adherence and methodological precision remained inconsistent, underscoring the need for human oversight, particularly for complex evidence syntheses.

Mark Helpful

Bookmark

Relay

View Full Paper