December 4, 2025Open Access

Thyroid Nodule Experts Evaluating ChatGPT’s Assessment of Thyroid Nodules Classified by the Bethesda System for Reporting Thyroid Cytopathology

Key Points

ChatGPT's recommendations for managing thyroid nodules show moderate to good consistency with clinical guidelines.
19 out of 24 responses achieved a consistency score, with an average score of 3.38 on a 4-point scale.
Assessment conducted by a panel of specialists revealed high inter-rater reliability at 95.8% for the evaluations.
While useful for patient discussions, ChatGPT’s advice may lack reliability, particularly in high-risk malignancy categories.

Abstract

Importance ChatGPT has emerged as a medical resource through advanced language processing. Patients with thyroid nodules classified under The Bethesda System for Reporting Thyroid Cytopathology (TBSRTC) may use it to complement discussions with physicians. Objective We aimed to determine whether ChatGPT's recommendations on managing thyroid nodules classified by TBSRTC align with those of experienced thyroid specialists. Setting/Participants A multidisciplinary panel of 5 thyroid cancer specialists, including otolaryngologists and endocrinologists, from 3 university-affiliated teaching hospitals in Montreal, Canada, evaluated the responses. Intervention/Exposure ChatGPT-3.5 was prompted with 4 questions for each of the 6 Bethesda categories regarding the meaning and management of thyroid nodules, generating 24 responses for evaluation. Main Outcome/Measures We assessed ChatGPT’s accuracy against the latest American Thyroid Association (ATA) guidelines using a 4-point Likert scale (90%). Additionally, specialists rated their comfort or reluctance in recommending ChatGPT as a complementary tool for patient discussions. Results Of the 24 ChatGPT-generated responses, 19 (79.2%) demonstrated moderate to good consistency with the ATA guidelines. The mean consistency score was 3.38/4 and median was 3.5. Consensus (IQR ≤ 1) was achieved in 23 out of 24 responses (95.8%), reflecting strong inter-rater reliability. Consistency scores were highest in Bethesda I–III and declined progressively in higher-risk categories, with the lowest mean score observed in Bethesda VI. Similarly, an upward trend in clinician reluctance was observed from Bethesda I through VI, indicating greater caution in recommending ChatGPT responses for patients suspicious for or diagnosed with malignancy (Bethesda V–VI). Conclusion and Relevance While ChatGPT’s responses generally align with specialist recommendations, they are not fully reliable. ChatGPT lacks the ability to serve as an independent or accurate source of medical advice for thyroid nodule management. It remains a useful complement for patient discussions, especially in low-risk scenarios, but further improvements are necessary to make it a safe, reliable component of patient care in complex cases.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper