What does this research mean for the field?

ChatGPT-5 can generate clinically accurate and well-structured medical literature but exhibits significant weaknesses in reference reliability and evidence synthesis, necessitating its use as an expert-supervised drafting aid rather than a standalone author. Novelty: ClaimNovelty.INCREMENTAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This study aims to evaluate the accuracy and reliability of content generated by ChatGPT-5 on penile prosthesis implantation.

May 29, 2026Open Access

Evaluation of ChatGPT-5–generated surgical literature: the accuracy of a review on penile prosthesis implantation

Key Points

This study aims to evaluate the accuracy and reliability of content generated by ChatGPT-5 on penile prosthesis implantation.
Structured prompts were used for generating a narrative review on penile prosthesis implantation.
Accuracy was assessed through factual verification, reference validity analysis, plagiarism checks, and qualitative assessment.
Review quality was evaluated using standardized tools and peer-review rubrics.
ChatGPT-5 achieved high factual accuracy but contained errors in historical timelines and survival data.
Only about one-third of references were fully accurate, with some citations being fabricated or incomplete.
Overall quality received a good rating, with reviewers showing strong agreement.

Abstract

Abstract The integration of artificial intelligence language models into medical literature requires rigorous evaluation of accuracy and reliability, especially in specialized domains. This study assessed ChatGPT-5’s capacity to generate clinically accurate scientific content on penile prosthesis implantation. Using structured prompts, ChatGPT-5 produced a narrative review evaluated across four domains: (1) verification of factual statements, (2) reference validity via PubMed and Google Scholar, (3) plagiarism screening with iThenticate and Quetext, and (4) qualitative assessment using Scale for the Assessment of Narrative Review Articles and a peer-review rubric. ChatGPT-5 demonstrated high factual accuracy overall, correctly supporting most statements, although errors were identified in historical timelines and survival data. In contrast, reference analysis revealed significant weaknesses, with only about one-third of citations being fully accurate and several containing fabricated or incomplete bibliographic details. Text similarity rates were low. Overall quality was rated as good according to standardized assessment tools, with strong agreement between reviewers. Collectively, these findings indicate that ChatGPT-5 can produce clinically accurate, well-structured content but demonstrates important weaknesses in reference reliability and evidence synthesis. The results support a hybrid model in which artificial intelligence serves as a drafting aid under expert supervision rather than as a standalone author. Future work should prioritize strengthening citation validity to enhance reliability while safeguarding scientific integrity.

Bookmark

View Full Paper

Bookmark

View Full Paper

Evaluation of ChatGPT-5–generated surgical literature: the accuracy of a review on penile prosthesis implantation

Key Points

Abstract

Cite This Study