Visual mobility scoring to detect lame dairy cattle can be subjective and inconsistent. This study assessed the reliability of visual mobility scores from multiple assessors, using different scoring methods (live vs. video) and experience levels to evaluate their influence on label quality for machine learning applications. We gathered data from two farms using the AHDB 4-point mobility scale and a simplified post-hoc dichotomised version, with both live and video assessments. Substantial within- and between-assessor variation was seen in scores, particularly for scores 0 and 1 (consistent with normal and slightly abnormal gaits, respectively). Assessors showed only fair (weighted kappa ≈ 0.33) score consistency when they scored the same animal in different ways (live vs. video). Post-hoc simplification of the four-level scores to a dichotomous score improved agreement but reduced granularity. Assessor experience had limited influence on agreement levels (P > 0.05), and increased video viewing frequency during the assessment process was associated with lower inter-assessor agreement (probability estimate = -0.49, P = 0.005), suggesting higher uncertainty in ambiguous cases. Qualitative feedback from assessor comments revealed that the speed of the animal affected their scoring decisions (β = -1.92, P = 0.007). These results highlight the difficulties in using subjective human scores as labels for machine learning training. To improve automatic lameness detection in dairy cattle, we need strategies to reduce this variation and use more definitive labels.
Linardopoulou et al. (Mon,) studied this question.