August 14, 2025Open Access

Comparative Performance of Large Language Models in Muscle Histology Classification Highlights Enhanced Accuracy of ChatGPT-4o in Tissue Identification

Key Points

ChatGPT achieved the highest F1 score of 0.772 for tissue type identification, outperforming all models.
For sectioning prediction, ChatGPT had an F1 score of 0.396, while other models showed varied success.
The evaluation utilized 300 digital histology images from a medical database, applying standard machine learning metrics.
These findings indicate that while ChatGPT excels in some areas, it requires enhancements in overall accuracy for clinical application.

Abstract

Introduction One of the most promising avenues of artificial intelligence (AI) integration into medicine is its examination, evaluation, and characterization of pathological slides. The use of large language models (LLMs), the AI model subtype that is becoming increasingly popular, in pathological applications remains unexplored. This study investigates the histological image recognition capabilities of the multimodal models Gemini 1.5 Flash, ChatGPT-4o, and Claude 3.5 Sonnet and assesses their suitability for clinical or medical education use. Methods The models were evaluated using 300 digital histology images derived from the University of South Florida Morsani College of Medicine Virtual Microscopy database, with a prompt to ascertain each model's ability to identify tissue type and plane of sectioning used. The images included the three subtypes in both longitudinal and transverse planes of sectioning. Standard machine learning metrics such as precision, recall, accuracy, and F1 score were used to classify and evaluate each model's abilities. Results In the prediction of tissue type, OpenAI's ChatGPT had the highest metrics with an F1 score of 0.772, while Claude yielded an F1 score of 0.380, and Gemini produced a 0.460 F1 score. In the prediction of sectioning, ChatGPT produced an F1 score of 0.396, while Claude produced a value of 0.472, and Gemini yielded 0.344. Conclusion Overall, the results indicate that ChatGPT is most effective at identifying tissues. However, the inaccuracy demonstrated in evaluating sectioning compared to other models leaves room for improvement in its overall accuracy across varying tissue samples to reliably supplement medical education or clinical use.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Parth Shah

Pondicherry University

David J. Boughanem

University of South Florida

John Michael Templeton

University of South Florida

Journals

Cureus

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Comparative Performance of Large Language Models in Muscle Histology Classification Highlights Enhanced Accuracy of ChatGPT-4o in Tissue Identification

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study