What question did this study set out to answer?

This research aims to improve automated plant trait recognition in herbarium images by reducing reliance on distracting background elements.

March 16, 2026Open Access

AT‐ViT: Area‐Targeted Multi‐View Vision Transformer With Cross‐Attention and Multi‐Scale Patching for Plant Trait Recognition in Herbarium Images

Key Points

This research aims to improve automated plant trait recognition in herbarium images by reducing reliance on distracting background elements.
Developed a dual-branch vision transformer named AT-ViT.
Utilized multi-scale, multi-view cross-attention fusion to combine raw and segmented images.
Implemented a mask-guided patch weighting mechanism to focus on plant-relevant features.
AT-ViT consistently improved accuracy and attention localization for plant regions.
Showed significant increases in spatial attention grounding with improved plant-region alignment.
Outperformed existing models by substantial accuracy points under background perturbations.

Abstract

ABSTRACT Automated plant traits recognition from herbarium images is essential for plant sciences, yet it remains challenging because background elements (e. g. , textual labels, mounting artefacts and colour charts) can introduce shortcut learning, leading models to rely on spurious nonplant cues rather than plant morphology. This bias degrades both generalisation and interpretability. In this paper, we introduce AT‐ViT, a dual‐branch vision transformer that jointly encodes raw herbarium scans and their segmented‐derived counterparts via a multi‐scale, multi‐view cross‐attention fusion scheme. AT‐ViT further incorporates a mask‐guided patch weighting mechanism that amplifies plant‐relevant regions and attenuates background‐driven features. By learning from the original scans while being guided by segmentation masks through the mask‐guided patch reweighting mechanism, the model is encouraged to focus on plant organs and learn plant‐centric representations more effectively. Across multiple trait classification tasks (e. g. , leaf base shape, thorns), AT‐ViT delivers consistent accuracy gains, improves attention localisation on plant regions and exhibits increased robustness under synthetic background perturbations. Specifically, AT‐ViT substantially improves spatial attention grounding, boosting plant‐region alignment (Avg IoUₚ: +15. 66 to +18. 03 pp) while reducing background overlap (Avg IoUb: −27. 92 to −31. 02 pp) relative to CrossViT, and remains markedly more robust to background perturbations, outperforming ResNet101 by up to +32. 32 accuracy points and CrossViT by up to +5. 07 points under background‐noise conditions.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Sedrat et al. (Thu,) studied this question.

synapsesocial.com/papers/69b79e7c8166e15b153abd25 https://doi.org/https://doi.org/10.1049/cvi2.70059

Bookmark

View Full Paper