Key points are not available for this paper at this time.
Abstract Purpose This study assessed the ability of large language models (LLMs)–ChatGPT-4.0 and Perplexity Pro–to generate accurate and concise summaries of landmark surgical studies for use in surgical resident education. Methods Surgical attendings from four subspecialties selected landmark studies within their respective fields. ChatGPT-4.0 and Perplexity Pro were prompted to generate one-page summaries of each study. Blinded surgical attendings evaluated summaries for accuracy across six domains: background, methods, inclusion/exclusion criteria, treatment groups, key findings, and clinical relevance. Precision, recall, and F1 scores were calculated. Precision measures the ability of LLMs to include only true information, while recall measures the ability to include all relevant information. The F1 score is a composite of both precision and recall that represents the LLMs’ overall performance. Residents were surveyed to assess the educational utility of LLM-generated summaries. Results ChatGPT-4.0 achieved a precision of 0.93, recall of 0.96, and F1 score of 0.95. Perplexity Pro performed similarly with a precision of 0.96, a recall of 0.94, and an F1 score of 0.95. Both models performed most poorly in the reporting of inclusion/exclusion criteria, with ChatGPT-4.0 achieving a precision of 0.93, a recall of 0.84, and an F1 score of 0.89. Perplexity achieved a precision of 1.00, a recall of 0.81, and an F1 score of 0.90 in this domain. There were no significant differences in the performance of both models. All residents surveyed reported that the summaries would be helpful as a supplement to their regular study materials and recommended LLM-generation of more summaries. Conclusions AI-generated summaries from both ChatGPT and Perplexity were accurate and were perceived as educationally valuable by surgical residents. This approach offers an efficient tool for integrating key surgical literature into resident education.
Pettigrew et al. (Fri,) studied this question.