What question did this study set out to answer?

This research evaluates the diagnostic performance of ChatGPT in interpreting histopathological images compared to experienced pathologists.

March 2, 2026Open Access

Exploring the Diagnostic Limits of Chatgpt: How Far Can a Large Language Model Go in Histopathological Image Interpretation?

Key Points

This research evaluates the diagnostic performance of ChatGPT in interpreting histopathological images compared to experienced pathologists.
Evaluated 24 histopathological images by ChatGPT-4o mini and 15 experienced pathologists.
Standard diagnostic queries were used without clinical context.
Responses were classified as correct, false positive, false negative, low-impact error, or no interpretation.
Statistical analyses were performed using McNemar’s test and Fisher’s exact test.
ChatGPT-4o mini achieved 71.4% accuracy, 60.0% sensitivity, and 77.8% specificity.
Pathologists averaged 89.8% accuracy with 97.7% sensitivity and 87.1% specificity.
Low-impact errors were 33.3% for ChatGPT compared to 6.9% for pathologists.
Statistical analysis showed significant differences favoring pathologists.

Abstract

Aims: Artificial intelligence’s integration into pathology has accelerated with the adoption of digital workflows. Large language models like ChatGPT offer unique opportunities but have yet to be systematically evaluated in diagnostic image interpretation. Methods: In this comparative study, 24 histopathological images representing various tissue types and pathological entities were evaluated by ChatGPT-4o mini and 15 experienced pathologists. The model was prompted with a standard diagnostic query without access to clinical information. Pathologists independently assessed the same images. Responses were categorized as correct, false positive, false negative, low-impact error, or no interpretation. Standard diagnostic metrics were calculated, and group comparisons were conducted using McNemar’s test and Fisher’s exact test. Interobserver agreement among pathologists was analyzed using Fleiss’ kappa. Results: ChatGPT-4o mini achieved an accuracy of 71.4%, with a sensitivity of 60.0% and a specificity of 77.8%. The average accuracy of pathologists was 89.8%, with 97.7% sensitivity and 87.1% specificity. Low-impact errors were more frequent with ChatGPT-4o mini (33.3%) compared to pathologists (6.9%). McNemar’s test revealed a statistically significant difference in favor of pathologists. The interobserver agreement among pathologists was in the lower range. Conclusion: While ChatGPT-4o mini demonstrated partial diagnostic capabilities, it underperformed compared to experienced pathologists. The absence of a clinical context likely impacted the results. Future artificial intelligence models integrating image analysis and clinical data may enhance performance. Despite limitations, the potential ChatGPT holds as a supportive diagnostic tool in pathology is highlighted in this study.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Aghajan Musali

Jamal Musayev

Journals

TURKISH MEDICAL STUDENT JOURNAL

SHILAP Revista de lepidopterología

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Exploring the Diagnostic Limits of Chatgpt: How Far Can a Large Language Model Go in Histopathological Image Interpretation?

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study