What question did this study set out to answer?

The aim is to develop an effective method for pest recognition that integrates environmental data with visual inputs.

June 10, 2026Open Access

Multimodal Method for Pest Recognition Using Field Images and Environmental Data in Smart Agriculture

Key Points

The aim is to develop an effective method for pest recognition that integrates environmental data with visual inputs.
A multimodal self-supervised pretraining framework involving field images and environmental sensor data was designed.
Features from images and environmental data were extracted and aligned using contrastive consistency and temporal consistency modules.
The approach included a collaborative fusion module to integrate information from various data sources.
Achieved Accuracy, Precision, Recall, and F1-score of 94.37%, 93.96%, 93.42%, and 93.69%, respectively.
F1-score dropped significantly to 91.00%, 91.36%, and 90.49% when critical modules were removed, highlighting their importance.

Abstract

Accurate pest recognition is an important foundation for intelligent plant protection, precision pesticide application, and sustainable agricultural management. However, in real field environments, pest targets are often small in scale, severely occluded, and embedded in complex backgrounds, which limits the performance of existing supervised learning methods under low-annotation and cross-scenario conditions. To address these issues, a multimodal self-supervised pretraining framework is proposed for pest recognition, in which field pest images and environmental sensor data are integrated to construct pest representations with environmental awareness. In this framework, image features, including pest morphology, leaf texture, and damaged regions, are first extracted through a visual encoding branch, while temporal variation features of ecological factors, including temperature, humidity, illumination, soil moisture, rainfall, and wind speed, are modeled through an environmental encoding branch. On this basis, a cross-modal contrastive consistency module is designed to align visual and environmental representations, a temporal consistency self-supervised module is introduced to characterize the continuous evolutionary relationship between pest occurrence and environmental changes, and a multimodal collaborative representation fusion module is constructed to adaptively integrate information from different modalities. The experimental results show that the proposed method achieves favorable performance in the pest recognition task, with Accuracy, Precision, Recall, and F1-score reaching 94.37%, 93.96%, 93.42%, and 93.69%, respectively, outperforming ConvNeXtV2-T, ViT-B/16, Swin-T, SimCLR, MAE, and the conventional Image + Sensor fusion method. The ablation experiments further show that, after removing the cross-modal contrastive consistency module, the temporal consistency self-supervised module, and the multimodal collaborative fusion module, the F1-score decreases to 91.00%, 91.36%, and 90.49%, respectively, thereby demonstrating the contribution of each module. This study provides a viable multimodal self-supervised learning approach for AI-driven intelligent pest recognition, early warning, and precision control in agriculture.

Multimodal Method for Pest Recognition Using Field Images and Environmental Data in Smart Agriculture

Key Points

Abstract

Cite This Study