Background: Large language models (LLMs) such as ChatGPT have evolved rapidly, with notable improvements in coherence, factual accuracy, and contextual relevance. However, their academic and clinical applicability remains under scrutiny. This study evaluates the temporal performance evolution of LLMs by comparing earlier model outputs (GPT-3.5 and GPT-4.0) with ChatGPT-4.5 across three domains: aesthetic surgery counseling, an academic discussion base of thumb arthritis, and a systematic literature review. Methods: We replicated the methodologies of three previously published studies using identical prompts in ChatGPT-4.5. Each output was assessed against its predecessor using a nine-domain Likert-based rubric measuring factual accuracy, completeness, reference quality, clarity, clinical insight, scientific reasoning, bias avoidance, utility, and interactivity. Expert reviewers in plastic and reconstructive surgery independently scored and compared model outputs across versions. Results: ChatGPT-4.5 outperformed earlier versions across all domains. Reference quality improved most significantly (a score increase of +4.5), followed by factual accuracy (+2.5), scientific reasoning (+2.5), and utility (+2.5). In aesthetic surgery counseling, GPT-3.5 produced generic responses lacking clinical detail, whereas ChatGPT-4.5 offered tailored, structured, and psychologically sensitive advice. In academic writing, ChatGPT-4.5 eliminated reference hallucination, correctly applied evidence hierarchies, and demonstrated advanced reasoning. In the literature review, recall remained suboptimal, but precision, citation accuracy, and contextual depth improved substantially. Conclusion: ChatGPT-4.5 represents a major step forward in LLM capability, particularly in generating trustworthy academic and clinical content. While not yet suitable as a standalone decision-making tool, its outputs now support research planning and early-stage manuscript preparation. Persistent limitations include information recall and interpretive flexibility. Continued validation is essential to ensure ethical, effective use in scientific workflows.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ishith Seth
Gianluca Marcaccini
Bryan Lim
Informatics
Harvard University
Massachusetts General Hospital
University of Siena
Building similarity graph...
Analyzing shared references across papers
Loading...
Seth et al. (Tue,) studied this question.
www.synapsesocial.com/papers/68af6210ad7bf08b1eae35f8 — DOI: https://doi.org/10.3390/informatics12030086