What question did this study set out to answer?

The study aims to evaluate the reliability of visual lameness scoring across different assessors and methods to enhance automated detection systems.

March 12, 2026Open Access

How reliable is visual lameness scoring? Assessing human label variability for use in automated detection systems

Key Points

The study aims to evaluate the reliability of visual lameness scoring across different assessors and methods to enhance automated detection systems.
Assessed visual mobility scores from multiple assessors using live and video scoring methods.
Used the AHDB 4-point mobility scale and a simplified dichotomous scoring version.
Analyzed within- and between-assessor variations in scoring consistency.
Collected qualitative feedback from assessors regarding influencing factors.
Scores showed substantial variability among assessors, especially for scores indicating normal and slightly abnormal gaits.
Weighted kappa indicated only fair consistency among assessors (≈ 0.33).
Simplifying scores improved agreement but decreased detailed assessments.
Increased viewings of video assessments correlated with lower agreement among assessors.
Speed of the animals influenced scoring decisions significantly.

Abstract

Visual mobility scoring to detect lame dairy cattle can be subjective and inconsistent. This study assessed the reliability of visual mobility scores from multiple assessors, using different scoring methods (live vs. video) and experience levels to evaluate their influence on label quality for machine learning applications. We gathered data from two farms using the AHDB 4-point mobility scale and a simplified post-hoc dichotomised version, with both live and video assessments. Substantial within- and between-assessor variation was seen in scores, particularly for scores 0 and 1 (consistent with normal and slightly abnormal gaits, respectively). Assessors showed only fair (weighted kappa ≈ 0.33) score consistency when they scored the same animal in different ways (live vs. video). Post-hoc simplification of the four-level scores to a dichotomous score improved agreement but reduced granularity. Assessor experience had limited influence on agreement levels (P > 0.05), and increased video viewing frequency during the assessment process was associated with lower inter-assessor agreement (probability estimate = -0.49, P = 0.005), suggesting higher uncertainty in ambiguous cases. Qualitative feedback from assessor comments revealed that the speed of the animal affected their scoring decisions (β = -1.92, P = 0.007). These results highlight the difficulties in using subjective human scores as labels for machine learning training. To improve automatic lameness detection in dairy cattle, we need strategies to reduce this variation and use more definitive labels.

Bookmark

View Full Paper

Bookmark

View Full Paper

How reliable is visual lameness scoring? Assessing human label variability for use in automated detection systems

Key Points

Abstract

Cite This Study