What question did this study set out to answer?

To assess how well large language models perform in scoring postoperative adhesions compared to novice veterinary surgeons.

April 10, 2026

Comparison of adhesion scoring performance between humans and large language models in experimental rat laparotomy

Key Points

To assess how well large language models perform in scoring postoperative adhesions compared to novice veterinary surgeons.
Compared macroscopic adhesion scoring by three large language models and two novice surgeons against expert consensus.
Used the Nair 0-4 adhesion scale for scoring 80 postoperative laparotomy cases in Wistar rats.
Analyzed data using Kruskal-Wallis test, Spearman coefficients for correlations, and Cohen's κ for agreement.
Significant group differences in scoring between novice surgeons and LLMs were found.
Novice 1 had the highest exact-match accuracy at 33.8% while LLMs scored ≤26.3% accuracy.
Moderate inter-observer reliability among human raters was observed (ICC = 0.55).

Abstract

This study compared the macroscopic adhesion scoring performance of large language models (LLMs: ChatGPT-o3, ChatGPT-5, Gemini-2.5 Pro) with that of novice veterinary surgeons, using expert consensus as the reference. Eighty standardized postoperative laparotomy cases in Wistar rats were photographed and scored using the Nair 0-4 adhesion scale. Two novice surgeons and three LLMs independently evaluated each case; the expert reference was defined by a surgeon and a pathologist. Group differences were analyzed using the Kruskal-Wallis test with Dunn-Bonferroni post hoc comparisons, correlations by Bonferroni-adjusted Spearman coefficients, human interobserver reliability by intraclass correlation coefficient (ICC) (A,1), and agreement with the expert by quadratic-weighted Cohen's κ and exact-match accuracy. Overall differences were significant. ChatGPT-o3, ChatGPT-5, Gemini-2.5 Pro, and Novice 1 assigned lower scores, while Novice 2 assigned higher scores. Correlations with the expert were significant for Novice 1 (ρ = 0.706), Novice 2 (ρ = 0.593), and ChatGPT-o3 (ρ = 0.617), but not for ChatGPT-5 or Gemini-2.5 Pro. Inter-observer reliability among human raters was moderate (ICC = 0.55). Importantly, absolute exact-match accuracies were modest across all evaluators, with the highest accuracy observed for Novice 1 (33.8%) and ⩽26.3% for the LLMs. While novices outperformed the models, these findings highlight the intrinsic difficulty of fine-grained Nair 0-4 adhesion scoring on two-dimensional intraoperative images and indicate that current LLMs are better suited as calibrated decision-support tools rather than stand-alone raters.

Bookmark

Comparison of adhesion scoring performance between humans and large language models in experimental rat laparotomy

Key Points

Abstract

Cite This Study