Key points are not available for this paper at this time.
See editorial on page 336. See editorial on page 336. ChatGPT (OpenAI) is a 175 billion–parameter large language model (LLM) artificial intelligence (AI) that was released in November 2022. ChatGPT is developed based on the generative pretrained transformer (GPT) 3.5 natural language processing technology and provides a conversational text response to a given prompt.1OpenAI.https://openai.com/blog/chatgptDate accessed: February 8, 2023Google Scholar One potential application of ChatGPT is answering patients’ medical questions. With more than 70 million procedures annually in the United States,2Ladabaum U. et al.Gastroenterology. 2019; 157: 137-148Abstract Full Text Full Text PDF PubMed Scopus (122) Google Scholar screening colonoscopies are frequently the subject of questions in gastroenterology. In this study, we examine the quality of ChatGPT-generated answers to common questions (CQs) about colonoscopy. We retrieved 8 CQs and answers about colonoscopy from the publicly available webpages of 3 randomly selected hospitals from the top-20 list of the US News nonsignificant) (Table 1). The raters demonstrated only 48% accuracy in identifying AI-generated answers, with 41% sensitivity and 54% specificity. Three raters had an accuracy of less than 50%, and 1 (a fellow) had 81% accuracy (Supplementary Figure 1 and Supplementary Table 2). This study is the first of its kind, to our knowledge, to demonstrate that a contemporary LLM-derived conversational AI program is able to provide easy-to-understand, scientifically adequate, and generally satisfactory answers to CQs about colonoscopy as determined by gastroenterologists. One surprising finding was the low sensitivity in identifying AI-generated answers (sensitivity of 6%, 25%, and 44%, respectively). Heuristic feedbacks from the outperforming fellow revealed that “ChatGPT answers tended to be lengthy, used many colons (‘:’) in the long list of possibilities it gave, and tended to be more of a list rather than a narrative paragraph in response.” Contrastingly, answers from hospital webpages were “more like verbal responses to a patient as opposed to something more encyclopedic.” This study suggests a potential role of conversational AI programs in optimizing the communication between patients and health care providers, especially for high-volume procedures like colonoscopy. Despite similar ratings, there was little overlap or plagiarism between the AI and non-AI answers as well as between the 2 AI answers (Supplementary Table 1 and Supplementary Table 3), which suggested the inherent plagiarism-avoiding design in LLMs and the capabilities of LLMs to create unique answers to the same question. Accumulated publications about ChatGPT in PubMed grew 10-fold from 20 on February 3, 2023, to 246 on April 14, 2023 (Supplementary Figure 2), with topics including board examinations,3Gilson A. et al.JMIR Med Educ. 2023; 9e45312Crossref PubMed Scopus (344) Google Scholar authorship, editorial policies,4Stokel-Walker C. et al.Nature. 2023; 614: 214-216Crossref PubMed Scopus (166) Google Scholar medical education,5Mbakwe A.B. et al.PLOS Digit Health. 2023; 2e0000205Crossref PubMed Google Scholar clinical decision support,6Gaumgartner C. Clin Transl Med. 2023; 13e1206Google Scholar a LLM assessment framework,7Howard A. et al.Lancet Infect Dis. 2023; 23: 405-406Abstract Full Text Full Text PDF PubMed Scopus (49) Google Scholar etc. Although early in the adoption curve,8Rogers E.M. Diffusion of innovations.5th ed. Simon and Schuster, New York2003Google Scholar LLMs (ChatGPT, BioGPT, BARD, and others) may represent a transformative innovation in how medical information (MI) is created by physicians and consumed by patients. Especially in the current era of shared decision making and the consumerization of health care, patients have been actively consuming MI through multiple channels and accessing providers through electronic patient portals at an exponential magnitude, which has the potential to benefit patients but, simultaneously, represents a heavy burden for providers and staff. We envision that AI-generated MI, with appropriate provider oversight, accreditation, and periodic surveillance, could improve the efficiency of care and free providers for more cognitively intensive patient communications. Nevertheless, potential pitfalls have to be addressed. Currently, ChatGPT-generated MI is not constructed on the basis of clinical evidence but is created through an LLM trained on diverse Internet texts with reinforcement learning by human feedback.1OpenAI.https://openai.com/blog/chatgptDate accessed: February 8, 2023Google Scholar LLM outputs may be sensitive and vulnerable to prompt engineering, that is, manipulation by subtle changes in inputting prompts, and the consistency of performance might be in “a state of constant change.”9Lee P. et al.N Engl J Med. 2023; 388: 1233-1239Crossref PubMed Scopus (254) Google Scholar Thus, there remains a large gap, technology- and format-wise, regarding the use LLMs in responsible clinical care.10Sackett D.L. et al.Br J Med. 1996; 312: 71-72Crossref PubMed Google Scholar Implicit bias is another concern, because the clinical utility might differ for patients with or without resources. Furthermore, readability analyses using validated reading-level metrics (Flesch-Kincaid Grade Level, Gunning Fog Index) revealed that the AI-generated answers were written with significantly higher grade reading levels than the hospital webpages (P < .001), far exceeding the eighth grade thresholds recommended (Supplementary Table 4). This study has several limitations. First, we did not include patient raters, the group to which colonoscopy preparation answers will be ultimately provided. For this study, we aimed to initially critique AI-generated MI through the lens of medical professionals. Future research should explore responses to a broader sample of questions and clinical conditions, as well as the inclusion of patient raters. Second, numbers of both the hospital webpages and raters were small, which limited broad generalizability. Finally, webpages of randomly selected top-tier hospitals may not be comprehensive. This study shows that a conversational AI program can generate credible MI in response to common patient questions. With dedicated domain training, there is meaningful potential to optimize clinical communication to patients. Tsung-Chun Lee, MD (Conceptualization: Equal; Data curation: Equal; Formal analysis: Equal; Investigation: Equal; Methodology: Equal; Project administration: Equal; Writing – original draft: Equal; Writing – review Formal analysis: Equal; Methodology: Equal; Writing – review Writing – review Writing – review Writing – review Data curation: Lead; Formal analysis: Lead; Investigation: Lead; Methodology: Lead; Project administration: Lead; Resources: Lead; Supervision: Lead; Writing – original draft: Lead; Writing – review 20: E366-E370Crossref PubMed Scopus (7) Google Scholar We measured the reading levels of all answers to CQs by 2 objective indexes of reading level of texts: Flesch-Kincaid Grade Level4Kincaid JP, et al. http://stars.library.ucf.edu/istlibrary/56. Accessed April 12, 2023.Google Scholar and Gunning Fog Index5Avra T.D. et al.J Vasc Surg. 2022; 76: 1728-1732Abstract Full Text Full Text PDF PubMed Scopus (4) Google Scholar (Supplementary Table 4). Both indexes are well-recognized objective measures, in which index number x represents the corresponding xth grade reading level.3Murphy B. et al.Surgeon. 2022; 20: E366-E370Crossref PubMed Scopus (7) Google Scholar Medical information given to patients ideally should have an index of 8 or less. Measurements were performed by an online readability tool (https://readable.com, accessed on April 12, 2023). We searched the PubMed database with the keyword of “ChatGPT” and obtained the list of publications that involved ChatGPT.6National Library of Medicine.https://pubmed.ncbi.nlm.nih.govDate accessed: April 12, 2023Google Scholar The chronology of ChatGPT publications in PubMed is shown in Supplementary Figure 2. Data are shown as mean or mean (standard deviation). Comparison of quality indicators on answers from AI vs non-AI sources was performed using the Mann-Whitney U test. After adjustment for multiple comparisons, the new Bonferroni-corrected α value was calculated as the following: 0.05 divided by 56 comparisons, that is, 0.00089. Therefore, P < .00089 was regarded as significant in the comparison of the 3 quality indicators among answers from AI and non-AI sources. Each rater’s performance in detecting AI-generated answers was collectively calculated and expressed as the sensitivity, positive negative and In reading level we compared the reading levels of the answers from the sources with an eighth grade reading level in and also compared the reading levels of answers from AI and non-AI sources using Mann-Whitney U Statistical analyses were performed using Figure of publications to ChatGPT in the PubMed database on April 14, Figure Table of Answers to 8 vs vs vs vs is a vs vs vs vs is a colonoscopy vs vs vs vs to prepare for a vs vs vs vs to expect during the colonoscopy vs vs vs vs to expect after the colonoscopy vs vs vs vs to do after a negative colonoscopy vs vs vs vs to do after a positive colonoscopy vs vs vs vs to expect about vs vs vs vs Answers as AI1 and AI2 were obtained with the 8 CQs as prompts to ChatGPT on the same Answers were retrieved from webpages of 3 top-tier hospitals in the United Text similarity of answers was compared with accessed: February 15, 2023Google Scholar and is as the AI answers shared except the AI answers had extremely low text similarity (0%–16%) to from the hospital not Open table in a new tab Supplementary Table of 4 in Answers by answers generated by = gastroenterologists = = Open table in a new tab Supplementary Table Answers From AI and Non-AI is a colonoscopy is a medical in which a a a to examine the of the and for such as or of The is performed the patient is and to an colonoscopy is a that a to examine the of the large and The is performed using a a which is with a and a The the through the and it the to the of the The the to such as or as well as to the health of the colonoscopy is an screening tool for and can the in its early it is The to an to and is performed the patient is to colonoscopy is a and that about to colonoscopy may be in a hospital or in a the a or (a in identifying and a to the is a a is the may it during the may use the to a – a sample – The sample will be to the for colonoscopy is an used to for changes such as or in the large and a a is the at the of the the to the of the or of can be through the during a colonoscopy. can be during a colonoscopy as is a that provider the of large or is using a The has a and on is in and In to provider the of the can be used the of with a with a in to it to with a provider may or for or may also be able to that are large or is the of to from to The large is about long in has 4 This on the of This from the to the This from the on This is because of its from the to the This is the of is a colonoscopy colonoscopy is performed for several for colonoscopy is an to for and is recommended for the of or for with a of the of a is such as or changes in a colonoscopy may be performed to the after a has had an on a or a colonoscopy may be performed for of For with such as a colonoscopy may be performed to the of the and for colonoscopy is performed to for and the large and common for a colonoscopy for colonoscopy is an screening tool for which is of the of in the United The can the in its early it is colonoscopy may be performed to such as and a is found during a the may it during the colonoscopies may be recommended to for new or to for the of colonoscopy can and such as and colonoscopy may be performed to of and to colonoscopy may also be performed to the large such as and to the for a colonoscopy with to the of for use a the and screening to patients for colonoscopy is a and that about to colonoscopy may be in a hospital or in a the a or (a in identifying and a to the is a a is the may it during the may use the to a – a sample – The sample will be to the for may a colonoscopy and colonoscopy can explore of and for or and at of have than may a colonoscopy have may a is of a for with about the for for more have had may a colonoscopy to for and This is to of an a colonoscopy may be for such as a or an in can provider for in These early of or is also used to for means for in have of the colonoscopy may be used to and such or the of the that might be in the may also be used to the of or in the GI can also be used to the after may be used as a for in or the for more provider may have to a colonoscopy. Open table in a new tab Supplementary Table of of Answers From AI and Non-AI of Grade Fog eighth grade reading analyses between the reading levels of the answers and the eighth grade reading level were with eighth grade reading analyses between the reading levels of the answers and the eighth grade reading level were with α for multiple = = P < as α for multiple = = P < as α for multiple = = P < as vs analyses between the reading levels of answers from AI vs non-AI sources were with Mann-Whitney U vs analyses between the reading levels of answers from AI vs non-AI sources were with Mann-Whitney U α for multiple = = P < as α for multiple = = P < as eighth grade reading analyses between the reading levels of the answers and the eighth grade reading level were with eighth grade reading analyses between the reading levels of the answers and the eighth grade reading level were with α for multiple = = P < as α for multiple = = P < as α for multiple = = P < as α for multiple = = P < as α for multiple = = P < as α for multiple = = P < as α for multiple = = P < as α for multiple = = P < as Data are shown as mean (standard deviation). Flesch-Kincaid Grade and Gunning Fog are well-recognized objective of the reading levels of in which the number x represents the corresponding xth grade reading level.3Murphy B. et al.Surgeon. 2022; 20: E366-E370Crossref PubMed Scopus (7) Google JP, et al. http://stars.library.ucf.edu/istlibrary/56. Accessed April 12, 2023.Google T.D. et al.J Vasc Surg. 2022; 76: 1728-1732Abstract Full Text Full Text PDF PubMed Scopus (4) Google Scholar Measurements were performed with an online readability tool (https://readable.com, accessed on April 12, Statistical analyses between the reading levels of the answers and the eighth grade reading level were with Bonferroni-corrected α for multiple = = P < as Statistical analyses between the reading levels of answers from AI vs non-AI sources were with Mann-Whitney U Open table in a new tab Answers as AI1 and AI2 were obtained with the 8 CQs as prompts to ChatGPT on the same Answers were retrieved from webpages of 3 top-tier hospitals in the United Text similarity of answers was compared with accessed: February 15, 2023Google Scholar and is as the AI answers shared except the AI answers had extremely low text similarity (0%–16%) to from the hospital not Data are shown as mean (standard deviation). Flesch-Kincaid Grade and Gunning Fog are well-recognized objective of the reading levels of in which the number x represents the corresponding xth grade reading level.3Murphy B. et al.Surgeon. 2022; 20: E366-E370Crossref PubMed Scopus (7) Google JP, et al. http://stars.library.ucf.edu/istlibrary/56. Accessed April 12, 2023.Google T.D. et al.J Vasc Surg. 2022; 76: 1728-1732Abstract Full Text Full Text PDF PubMed Scopus (4) Google Scholar Measurements were performed with an online readability tool (https://readable.com, accessed on April 12, 2023). of of ChatGPT and in on and of artificial intelligence large language are in health care, especially for patient and et the performance of ChatGPT in answering 8 common patient questions to colonoscopy and compared it with responses available on hospital The study that the ChatGPT answers were similar to non-AI answers in ease of understanding and scientific PDF on “ChatGPT Answers with the in Gastroenterology by et that aimed to the of ChatGPT to provide satisfactory answers to common patient questions about colonoscopy. artificial intelligence like ChatGPT more in health care, it is that we capabilities and limitations. The that ChatGPT can provide responses with medical in adequacy, and satisfaction ratings are PDF ChatGPT and “ChatGPT Answers et the of ChatGPT-generated responses to frequently asked questions about colonoscopy. to the conversational artificial intelligence (AI) software can medical information in response to patient With domain training, clinical communication with patients could be significantly PDF for the of in the and of health care to the of has the use of to and from the use of medical learning and artificial intelligence (AI) from and clinical and online this has in and to improve and patient PDF
Lee et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: