Abstract The integration of artificial intelligence language models into medical literature requires rigorous evaluation of accuracy and reliability, especially in specialized domains. This study assessed ChatGPT-5’s capacity to generate clinically accurate scientific content on penile prosthesis implantation. Using structured prompts, ChatGPT-5 produced a narrative review evaluated across four domains: (1) verification of factual statements, (2) reference validity via PubMed and Google Scholar, (3) plagiarism screening with iThenticate and Quetext, and (4) qualitative assessment using Scale for the Assessment of Narrative Review Articles and a peer-review rubric. ChatGPT-5 demonstrated high factual accuracy overall, correctly supporting most statements, although errors were identified in historical timelines and survival data. In contrast, reference analysis revealed significant weaknesses, with only about one-third of citations being fully accurate and several containing fabricated or incomplete bibliographic details. Text similarity rates were low. Overall quality was rated as good according to standardized assessment tools, with strong agreement between reviewers. Collectively, these findings indicate that ChatGPT-5 can produce clinically accurate, well-structured content but demonstrates important weaknesses in reference reliability and evidence synthesis. The results support a hybrid model in which artificial intelligence serves as a drafting aid under expert supervision rather than as a standalone author. Future work should prioritize strengthening citation validity to enhance reliability while safeguarding scientific integrity.
Ok et al. (Wed,) studied this question.