July 14, 2024

Contrasting the performance of mainstream Large Language Models in Radiology Board Examinations (Preprint)

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

BACKGROUND Artificial Intelligence advancements have enabled Large Language Models to significantly impact radiology education and diagnostic accuracy. OBJECTIVE This study evaluates the performance of mainstream Large Language Models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams. METHODS A comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on accuracy in text-based questions categorized by cognitive levels and medical specialties using chi-square tests and ANOVA. RESULTS GPT-4 achieved the highest accuracy (83.3%), significantly outperforming others. Tongyi Qianwen also performed well (70.7%). Performance varied across question types and specialties, with GPT-4 excelling in both lower-order and higher-order questions, while Claude and Bard struggled with complex diagnostic questions. CONCLUSIONS GPT-4 and Tongyi Qianwen show promise in medical education and training. The study emphasizes the need for domain-specific training datasets to enhance large models' effectiveness in specialized fields like radiology.

Preguntar a la IA

Me gusta

Guardar

Cite This Study

Boxiong Wei (Sun,) studied this question.

synapsesocial.com/papers/68e60668b6db643587599e79 https://doi.org/https://doi.org/10.2196/preprints.64284

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Preguntar a la IA

Me gusta

Guardar