What question did this study set out to answer?

The aim is to evaluate AI models' alignment with guidelines for diabetic foot infection management.

February 27, 2026Open Access

Comparative assessment of large language models in diabetic foot infection management: alignment with IWGDF/IDSA guidelines

Key Points

The aim is to evaluate AI models' alignment with guidelines for diabetic foot infection management.
Evaluated four AI models using clinical dimensions: Accuracy, Overconclusiveness, Supplementary Value, and Completeness.
Used a 5-point Likert scale for assessments and analyzed readability with Flesch Reading Ease and Flesch–Kincaid Grade Level.
Conducted statistical analyses including ANOVA and post hoc comparisons.
No significant differences in Accuracy and Overconclusiveness across the models.
Grok-3 significantly outperformed other models in Supplementary Value and Completeness dimensions.
DeepSeek-R1 generated the most complex text compared to ChatGPT-4o.

Abstract

Objective To assess the clinical utility of artificial intelligence (AI) models (ChatGPT-4o, DeepSeek-R1, Grok-3 and Claude-3.7) in aligning with international guidelines for diabetic foot infection (DFI) management. Background AI systems have demonstrated their potential application value in numerous fields. However, the specific effects of these technologies in the medical and health sector still require in-depth exploration. DFI is a relatively common and serious complication among diabetic patients, and the accurate transmission of relevant information is of great significance. Therefore, it is particularly important to evaluate whether artificial intelligence can serve as an effective clinical auxiliary tool. Methods Responses from ChatGPT-4o, DeepSeek-R1, Grok-3 and Claude-3.7 were evaluated against DFI guidelines using four clinical dimensions (Accuracy, Overconclusiveness, Supplementary Value, and Completeness) using a 5-point Likert scale, and assessed for readability using Flesch Reading Ease (FRE) and Flesch–Kincaid Grade Level (FKGL). Statistical analyses included ANOVA and post hoc comparisons. Results No significant differences were found across models for Accuracy and Overconclusiveness ( p 0.05). However, Supplementary Value differed significantly ( p 0.001), the performance of Grok-3 is superior to that of ChatGPT-4o ( p 0.0001), DeepSeek-R1 ( p =0.003), and Claude-3.7 ( p 0.0001). Meanwhile, there are significant differences in terms of Completeness ( p =0.005), Grok-3 outperforms ChatGPT-4o ( p =0.016)and Claude-3.7 ( p =0.010) significantly.Readability also varied: DeepSeek-R1 responses were more complex than ChatGPT-4o ( p = 0.046). Conclusion All models perform comparably in terms of accuracy and in avoiding over-conclusions. Grok-3 outperformed the other models in the dimensions of complementarity and completeness. DeepSeek-R1 generated the most complex text. These findings validate the feasibility of AI in the standardized management of DFI, but the models still need to be further verified through clinical trials to determine their value in the real-world decision-making process.

Bookmark

View Full Paper

Cite This Study

Wu et al. (Tue,) studied this question.

synapsesocial.com/papers/69a1344fed1d949a99abe156 https://doi.org/https://doi.org/10.3389/fendo.2026.1667159

Bookmark

View Full Paper