May 18, 2023Open Access

A Framework for Critically Assessing ChatGPT and Other Large Language Artificial Intelligence Model Applications in Health Care

Key Points

Key points are not available for this paper at this time.

Abstract

Large language models (LLMs) are pretrained artificial intelligence (AI) algorithms that can interpret text and generate human-like text in real time. 1Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Preprint. Posted online December 26, 2022. arXiv. 2212. 13138. https: //doi. org/10. 48550/arXiv. 2212. 13138Google Scholar Recent studies on LLMs (eg, PaLM and GPT-3. 5) have found near-human performance on medical examinations and many possible future applications have been discussed, such as drafting discharge summaries, answering consultations, or generating lists of differential diagnosis. 1Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Preprint. Posted online December 26, 2022. arXiv. 2212. 13138. https: //doi. org/10. 48550/arXiv. 2212. 13138Google Scholar, 2Liévin V, Egeberg Hother C, Winther O. Can large language models reason about medical questions? Preprint. Posted online July 17, 2022. arXiv. 2207. 08143. https: //doi. org/10. 48550/arXiv. 2207. 08143Google Scholar, 3Patel S. B. Lam K. ChatGPT: the future of discharge summaries? . Lancet Digit Health. 2023; 5: e107-e108Abstract Full Text Full Text PDF PubMed Scopus (84) Google Scholar, 4Hirosawa T. Harada Y. Yokose M. Sakamoto T. Kawamura R. Shimizu T. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: a pilot Study. Int J Environ Res Public Health. 2023; 20: 3378Crossref PubMed Scopus (20) Google Scholar, 5Gilson A. Safranek C. W. Huang T. et al. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023; 9e45312Crossref PubMed Scopus (128) Google Scholar, 6Kung T. H. Cheatham M. Medenilla A. et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023; 2e0000198Crossref PubMed Google Scholar, 7Howard A. Hope W. Gerada A. ChatGPT and antimicrobial advice: the end of the consulting infection doctor? . Lancet Infect Dis. 2023; 23: 405-406Abstract Full Text Full Text PDF PubMed Scopus (24) Google Scholar Moreover, ChatGPT can already automatically generate empathetic responses to patients and seemingly genuine scientific abstracts. 8Else H. Abstracts written by ChatGPT fool scientists. Nature. 2023; 613: 423Crossref PubMed Scopus (115) Google Scholar, 9Goode J. A mental health tech company ran an AI experiment on real users. Nothing’s stopping apps from conducting more. NBC News. https: //www. nbcnews. com/tech/internet/chatgpt-ai-experiment-mental-health-tech-app-koko-rcna65110Date accessed: January 19, 2023Google Scholar The number of possible applications is likely to continue to increase, especially with large public investments in health care AI and as LLMs' capabilities continue to improve. 10Health Secretary announces £250 million investment in artificial intelligence. Gov. UK. https: //www. gov. uk/government/news/health-secretary-announces-250-million-investment-in-artificial-intelligenceDate accessed: January 19, 2023Google Scholar, 11Rosemain M. Rose M. France to spend 1. 8 billion on AI to compete with U. S. , China. Reuters Techonology News. https: //www. reuters. com/article/us-france-tech-idUSKBN1H51XPDate accessed: January 19, 2023Google Scholar However, it can be challenging for clinicians with limited technical understanding to assess the feasibility of such applications. 12Chen M. Zhang B. Cai Z. et al. Acceptance of clinical artificial intelligence among physicians and medical students: a systematic review with cross-sectional survey. Front Med (Lausanne). 2022; 9: 990604Crossref PubMed Scopus (12) Google Scholar This could result in focusing on unrealistic applications or neglecting promising ones. Frameworks have been developed to assist in assessing AI applications for specific domains, eg, for radiology. 13Omoumi P. Ducarouge A. Tournier A. et al. To buy or not to buy-evaluating commercial AI solutions in radiology (the ECLAIR guidelines). Eur Radiol. 2021; 31: 3786-3796Crossref PubMed Scopus (54) Google Scholar However, no such framework exists for LLM applications. Therefore, this article posits a simple framework for nontechnical health care professionals for assessing the feasibility of potential LLM applications in health care. The framework consists of the following 4 steps: 1. Determine the main source of health care data that the LLM uses2. Determine the intended recipient of the LLM’s output3. Combine the answers from (1) and (2) to identify a category4. Assess fundamental limitations for that category Determine whether the health care data that the LLM will use to reply to prompts comes from patients (eg, health data or medical records), health care providers (eg, information on procedures, medications, research or organizational information, such as opening hours), or payers (eg, information on reimbursement of procedures). Determine whether the main reader of the output of LLM is a patient, provider, or payer. Combine the answers from steps 1 and 2 in the LLM feasibility framework (Table 1) to identify which category the solution belongs to. Table 1LLM Feasibility Framework: Matrix for Determining Category of LLM ApplicationMain recipient of outputMain source of health care dataUsing patient data…Using provider data…Using payer data…… to highly automate summaries or explanations of…PatientsAdapting output (see examples) to, eg. , individual patients’ health literacy, medical history, and current medicationsCategory 1Example: Patient’s own medical records (eg, discharge notes, laboratory results, investigations) Category 2Example: Provider information (eg, medications, treatments, preoperative processes) Category 3Example: Payer information (eg, coverage, explanation of health care system, available providers) ProvidersAdapting output (see examples) to, eg. , providers’ specific clinical context, resources, or inquiryCategory 2Example: Pertinent patient information (eg, from medical records, laboratory results) Category 2Example: Relevant medical information (eg, merging local or international guidelines, research) Category 3Example: Relevant payer information (eg, reimbursement, quality measures, or coverage) PayersAdapting output (see examples) to, eg, payers’ specific rules on coverage, reimbursement, or quality measuresCategory 2Example: Relevant population data (eg, aggregate statistics from free text medical records) Category 3Example: Relevant provider information (eg, quality, efficiency or cost of providers/pathways) Category 3Example: Improving existing internal knowledge management systemsLLM, large language model. Open table in a new tab LLM, large language model. Table 2 describes categories and corresponding fundamental limitations, which can be used to assess the feasibility of a specific application. LLMs can be used in many different ways and are developing rapidly. However, some limitations are intrinsic to the AI model itself, which can be seen in the literature to date, and these are unlikely to change despite the rapid development. In brief, these limitations are as follows: i. Lack of understanding: LLMs lack a human-like understanding of the real-world phenomena that words describe and only process their semantic representation. This lack of understanding is highlighted by unpredictable illogical errors in reasoning in recent LLM studies. 2Liévin V, Egeberg Hother C, Winther O. Can large language models reason about medical questions? Preprint. Posted online July 17, 2022. arXiv. 2207. 08143. https: //doi. org/10. 48550/arXiv. 2207. 08143Google Scholar, 5Gilson A. Safranek C. W. Huang T. et al. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023; 9e45312Crossref PubMed Scopus (128) Google Scholar, 14Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD. Evaluating ChatGPT as an adjunct for radiologic decision-making. Preprint. Posted online February 7, 2023. medRxiv. 23285399. https: //doi. org/10. 1101/2023. 02. 02. 23285399Google Scholar This lack of real-world understanding limits the extent to which LLMs can act autonomously without oversight and creates the need of control mechanisms to ensure the appropriateness of the output. ii. Lack of predictability: LLMs run the risk of creating “hallucinations” (text responses that are either nonsensical or unfaithful to the content they should use) and errors that are difficult to predict, which can entail patient risks. 1Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Preprint. Posted online December 26, 2022. arXiv. 2212. 13138. https: //doi. org/10. 48550/arXiv. 2212. 13138Google Scholar, 4Hirosawa T. Harada Y. Yokose M. Sakamoto T. Kawamura R. Shimizu T. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: a pilot Study. Int J Environ Res Public Health. 2023; 20: 3378Crossref PubMed Scopus (20) Google Scholar, 15Ji Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation. Preprint. Posted online February 8, 2022. arXiv. 2202. 03629. https: //doi. org/10. 1145/3571730Google Scholar Manufacturers must ensure that a medical LLM software performs in a safe and predictable manner according to relevant legislation (eg, the Medical Device Regulation in Europe). Guaranteeing that a LLM does not create any hallucination is challenging. This risk can be partially mitigated by, for example, letting a clinician assess the output before it is acted on (to identify errors) or by forcing the LLM to reference external sources for the statements in its output (to allow comparisons with the original data that the output is based on). iii. Lack of empathy: Even if LLMs can generate seemingly empathetic responses, they cannot experience emotions or empathize with a patient when providing emotional support. 16Montemayor C. Halpern J. Fairweather A. In principle obstacles for empathic AI: why we can’t replace human empathy in healthcare. AI Soc. 2022; 37: 1353-1359Crossref PubMed Scopus (8) Google Scholar Moreover, people may not perceive empathy as genuine when coming from an algorithm. 17Morris R. R. Kouddous K. Kshirsagar R. Schueller S. M. Towards an artificially empathic conversational agent for mental health applications: system design and user perceptions. J Med Internet Res. 2018; 20: e10148Crossref PubMed Scopus (118) Google Scholar This may change over time but is currently a limitation in, for example, using unsupervised LLM output to provide patients with sensitive information. Table 2LLM Feasibility Framework: Limitations relevant for each categoryCategoryExample of healthcare data usedFundamental limitations relevant for categoryLack of understandingLack of predictabilityLack of empathy1: Output without clinical supervision-Patient health data: e. g. medical records, blood results, patient reported outcome measures, data from wearables✓✓✓2: Supervised output which can impact clinical decisions-Patient health data (as above) -Generic provider data: information about e. g. medications, treatments, procedures, research-Specific provider data: information about e. g. clinicians, opening hours, services provided✓✓3: Administrative output-Provider information (generic/specific as above) -Payer data: administrative data, process measures, reimbursement, costs✓LLM, Large language model. Open table in a new tab LLM, Large language model. The framework could be applied as follows: Imagine a LLM application that aims to improve patient adherence by adapting generic medication information (provider data) to patients (patient recipient) and different levels of health literacy. The combination of data and recipient places the application in category 2, and therefore it would be important to understand how the application addresses the lack of understanding and predictability by LLMs. This framework has several limitations. First, it is not exhaustive but is designed as a simple heuristic for an initial understanding of what fundamental limitations a LLM application may have. This framework does not replace a comprehensive assessment, which is needed before clinical implementation. Such an assessment will include several important aspects, such as interpretability of models (to what extent one can understand why they produce a certain result), which LLM is used, if the training data are sufficiently representative and of high quality, and whether the model has been fine-tuned to medical data. Second, it can only be used to identify potentially impractical solutions but not to confirm the feasibility of solutions. Last, despite incorporating limitations that seem fundamental, these may change as LLMs and social norms develop. Notwithstanding the abovementioned limitations, this framework aims to aid nontechnical health care professionals in critically assessing emerging LLM applications and ensuring their development into clinically safe and useful tools. LLMs have great potential to improve many parts of health care, but more research is needed to understand their performance, safety, and effect on health care systems. Finally, when addressing novel emerging technologies, keep Amara’s law in mind: “We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run. ”18Amara R. 1925-2017, American futurologist. in: Ratcliffe S. Oxford Essential Quotations. 5th ed. Oxford University Press, 2016Google Scholar

Bookmark

View Full Paper

Bookmark

View Full Paper

A Framework for Critically Assessing ChatGPT and Other Large Language Artificial Intelligence Model Applications in Health Care

Key Points

Abstract

Cite This Study