ABSTRACT Automated plant traits recognition from herbarium images is essential for plant sciences, yet it remains challenging because background elements (e. g. , textual labels, mounting artefacts and colour charts) can introduce shortcut learning, leading models to rely on spurious nonplant cues rather than plant morphology. This bias degrades both generalisation and interpretability. In this paper, we introduce AT‐ViT, a dual‐branch vision transformer that jointly encodes raw herbarium scans and their segmented‐derived counterparts via a multi‐scale, multi‐view cross‐attention fusion scheme. AT‐ViT further incorporates a mask‐guided patch weighting mechanism that amplifies plant‐relevant regions and attenuates background‐driven features. By learning from the original scans while being guided by segmentation masks through the mask‐guided patch reweighting mechanism, the model is encouraged to focus on plant organs and learn plant‐centric representations more effectively. Across multiple trait classification tasks (e. g. , leaf base shape, thorns), AT‐ViT delivers consistent accuracy gains, improves attention localisation on plant regions and exhibits increased robustness under synthetic background perturbations. Specifically, AT‐ViT substantially improves spatial attention grounding, boosting plant‐region alignment (Avg IoUₚ: +15. 66 to +18. 03 pp) while reducing background overlap (Avg IoUb: −27. 92 to −31. 02 pp) relative to CrossViT, and remains markedly more robust to background perturbations, outperforming ResNet101 by up to +32. 32 accuracy points and CrossViT by up to +5. 07 points under background‐noise conditions.
Sedrat et al. (Thu,) studied this question.