This study aimed to evaluate the performance of three contemporary large language models (LLMs) ChatGPT 5.0, Claude 4.5 Sonnet, and Gemini 2.5 Flash in generating reporting-compliant abstracts of scientific articles in the dental literature. The models were assessed using guideline-based checklists (STROBE-A, PRISMA-A, CARE-A) and two readability metrics (FRE, FKGL). This experimental, within-items comparative study included 75 full-text PubMed articles (25 STROBE, 25 PRISMA, 25 CARE) published between 2020 and 2025. Original abstracts, author information, and metadata were removed before providing full texts to each LLM. All models received an identical standardized prompt, and generated summaries were independently evaluated using 12-item checklists aligned with the corresponding reporting guidelines. Readability was assessed using FRE and FKGL. Statistical analyses were performed using One-way ANOVA with Tukey post-hoc testing. LLM performance varied significantly across study design category. In meta-analyses, ChatGPT achieved higher quality scores than Claude (p = 0.007), while no significant differences were observed among models for case reports or original research (p > 0.05). All LLMs performed better in summarizing case reports and original research compared with meta-analyses (p < 0.05). Readability analysis showed that LLM-generated abstracts were consistently more readable than original abstracts and full texts (FRE: p < 0.001; FKGL: p < 0.001), with Gemini producing the highest readability values across all study design categories. LLMs demonstrated strong summarization performance in structurally simple study design categories case reports and original research while their performance declined for systematic reviews and meta-analyses. Although LLM-generated summaries substantially improved readability, guideline adherence and methodological precision remained inconsistent, underscoring the need for human oversight, particularly for complex evidence syntheses.
Uranbey et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: