Key points are not available for this paper at this time.
BackgroundLarge language models (LLMs) have attracted significant interest for automated clinical coding. However, early data show that LLMs are highly error-prone when mapping medical codes. We sought to quantify and benchmark LLM medical code querying errors across several available LLMs.MethodsWe evaluated GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b Chat performance and error patterns when querying medical billing codes. We extracted 12 months of unique International Classification of Diseases, 9th edition, Clinical Modification (ICD-9-CM), International Classification of Diseases, 10th edition, Clinical Modification (ICD-10-CM), and Current Procedural Terminology (CPT) codes from the Mount Sinai Health System electronic health record (EHR). Each LLM was provided with a code description and prompted to generate a billing code. Exact match accuracy and other performance metrics were calculated. Nonexact matches were analyzed using descriptive metrics and standardized measures of text and code similarity, including METEOR score, BERTScore, and cui2vec cosine similarity. We created and applied a CodeSTS manual similarity grading system to 200 randomly selected codes weighted by EHR code frequency. Using CodeSTS scores, we identified correct "equivalent" or "generalized" generated codes.ResultsA total of 7697 ICD-9-CM, 15,950 ICD-10-CM, and 3673 CPT codes were extracted. GPT-4 had the highest exact match rate (ICD-9-CM: 45.9%; ICD-10-CM: 33.9%; CPT: 49.8%). Among incorrectly matched codes, GPT-4 generated the most equivalent codes (ICD-9-CM: 7.0%; ICD-10-CM: 10.9%), and GPT-3.5 generated the most generalized but correct codes (ICD-9-CM: 29.9%; ICD-10-CM: 18.5%). Extracted code frequency, shorter codes, and shorter code descriptions were associated (P<0.05) with higher exact match rates in nearly all analyses.ConclusionsAll tested LLMs performed poorly on medical code querying, often generating codes conveying imprecise or fabricated information. LLMs are not appropriate for use on medical coding tasks without additional research. (Funded by the AGA Research Foundation and National Institutes of Health.)
Soroush et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: