What question did this study set out to answer?

June 19, 2026Open Access

Enhancing the quality and trustworthiness of large language model-generated summaries of clinical oncology literature

Key Points

This study aims to assess the quality and trustworthiness of large language model-generated summaries of clinical oncology literature.
Evaluated ten LLM-generated scientific and plain language summaries from the INSIDE prostate cancer dataset.
Compared with expert-written plain language summaries from the BioLaySumm dataset.
Assessed faithfulness, readability using Flesch-Kincaid scores, and verification through a panel of LLMs and human experts.
Fact verification against summaries was approximately 100%, confirming accurate fact extraction.
LLM-generated summaries showed high faithfulness (88.9%) and low hallucinations (9.6%) compared to human-written PLS (61.6% faithfulness, 40.6% hallucinations).
LLM-generated PLS had higher readability (FRE 42.3) than human-written versions (FRE 28.8).

Abstract

Abstract Objectives This study evaluated the quality and trustworthiness of large language model (LLM)-generated scientific and plain language summaries (PLS) from clinical oncology literature, focusing on faithfulness (absence of hallucinations), relevance, and readability. Materials and Methods Ten LLM-generated scientific summaries and PLS from the INSIDE (artificial INtelligence to Support Informed DEcision making) prostate cancer dataset. For comparison, expert-written PLS from the BioLaySumm dataset were used. A panel of 5 LLMs and 3 human experts verified faithfulness. Verification was performed on original facts and facts modified with varying levels of error (subtle, moderate, contradictory). Readability was assessed using Flesch-Kincaid Reading Ease (FRE) scores. Results Fact verification against the summaries was ∼100%, confirming accurate fact extraction. LLM panel vs human panel agreement was substantial (kappa 0.67), outperforming agreement among the interhuman (0.43 95% CI, 0.34–0.52) and inter-LLM (0.40 0.38–0.42) panels. Large language model scientific summaries showed high faithfulness (88.9% 88.0–89.8) and low hallucinations (9.6% 6.5–12.7) compared to human-written PLS (61.6% 60.1–63.1 faithfulness; 40.6% 37.8– 43.4 hallucinations). The LLMs detected errors sensitively with scores decreasing as fact modifications became more severe. Finally, LLM-generated PLS were more readable than human-written versions (FRE 42.3 interquartile range, IQR 35.27–49.41 vs 28.8 IQR 21.02–36.18). Discussion A panel of LLMs reliably assessed the faithfulness of scientific summaries to their original source and thus can help increase reliability for clinical use. The lower faithfulness in human-written PLS likely reflects extrinsic hallucinations added for context. Conclusion The study demonstrates a novel approach to automatically assess the quality and trustworthiness of LLM-generated scientific and PLS via faithfulness, relevance, and readability.

Enhancing the quality and trustworthiness of large language model-generated summaries of clinical oncology literature

Key Points

Abstract

Cite This Study