Abstract Objectives This study evaluated the quality and trustworthiness of large language model (LLM)-generated scientific and plain language summaries (PLS) from clinical oncology literature, focusing on faithfulness (absence of hallucinations), relevance, and readability. Materials and Methods Ten LLM-generated scientific summaries and PLS from the INSIDE (artificial INtelligence to Support Informed DEcision making) prostate cancer dataset. For comparison, expert-written PLS from the BioLaySumm dataset were used. A panel of 5 LLMs and 3 human experts verified faithfulness. Verification was performed on original facts and facts modified with varying levels of error (subtle, moderate, contradictory). Readability was assessed using Flesch-Kincaid Reading Ease (FRE) scores. Results Fact verification against the summaries was ∼100%, confirming accurate fact extraction. LLM panel vs human panel agreement was substantial (kappa 0.67), outperforming agreement among the interhuman (0.43 95% CI, 0.34–0.52) and inter-LLM (0.40 0.38–0.42) panels. Large language model scientific summaries showed high faithfulness (88.9% 88.0–89.8) and low hallucinations (9.6% 6.5–12.7) compared to human-written PLS (61.6% 60.1–63.1 faithfulness; 40.6% 37.8– 43.4 hallucinations). The LLMs detected errors sensitively with scores decreasing as fact modifications became more severe. Finally, LLM-generated PLS were more readable than human-written versions (FRE 42.3 interquartile range, IQR 35.27–49.41 vs 28.8 IQR 21.02–36.18). Discussion A panel of LLMs reliably assessed the faithfulness of scientific summaries to their original source and thus can help increase reliability for clinical use. The lower faithfulness in human-written PLS likely reflects extrinsic hallucinations added for context. Conclusion The study demonstrates a novel approach to automatically assess the quality and trustworthiness of LLM-generated scientific and PLS via faithfulness, relevance, and readability.
Stenzl et al. (Tue,) studied this question.