MedRAX framework and BiomedCLIP vision-language model showed the highest accuracy values. No statistically significant difference was observed between proprietary and open-source models, which may indicate potential for improving accuracy through refinement of open-source LLM-based models. Overall, accuracy values of evaluated models were insufficient for current clinical practice implementation. These results should be seen as exploratory given the small dataset size, single-centre design, different prompting strategies for foundation and domain-adapted models and use of PNG images instead of DICOM.
Khovanova et al. (Tue,) studied this question.