In the context of global food security and sustainable agricultural development, the efficient recognition and precise management of agricultural insect pests and their predators have become critical challenges in the domain of smart agriculture. To address the limitations of traditional models that overly rely on single-modal inputs and suffer from poor recognition stability under complex field conditions, a multimodal recognition framework has been proposed. This framework integrates RGB imagery, thermal infrared imaging, and environmental sensor data. A cross-modal attention mechanism, environment-guided modality weighting strategy, and decoupled recognition heads are incorporated to enhance the model’s robustness against small targets, intermodal variations, and environmental disturbances. Evaluated on a high-complexity multimodal field dataset, the proposed model significantly outperforms mainstream methods across four key metrics, precision, recall, F1-score, and mAP@50, achieving 91.5% precision, 89.2% recall, 90.3% F1-score, and 88.0% mAP@50. These results represent an improvement of over 6% compared to representative models such as YOLOv8 and DETR. Additional ablation studies confirm the critical contributions of key modules, particularly under challenging scenarios such as low light, strong reflections, and sensor data noise. Moreover, deployment tests conducted on the Jetson Xavier edge device demonstrate the feasibility of real-world application, with the model achieving a 25.7 FPS inference speed and a compact size of 48.3 MB, thus balancing accuracy and lightweight design. This study provides an efficient, intelligent, and scalable AI solution for pest surveillance and biological control, contributing to precision pest management in agricultural ecosystems.
Liu et al. (Sat,) studied this question.