Accurate pest recognition is an important foundation for intelligent plant protection, precision pesticide application, and sustainable agricultural management. However, in real field environments, pest targets are often small in scale, severely occluded, and embedded in complex backgrounds, which limits the performance of existing supervised learning methods under low-annotation and cross-scenario conditions. To address these issues, a multimodal self-supervised pretraining framework is proposed for pest recognition, in which field pest images and environmental sensor data are integrated to construct pest representations with environmental awareness. In this framework, image features, including pest morphology, leaf texture, and damaged regions, are first extracted through a visual encoding branch, while temporal variation features of ecological factors, including temperature, humidity, illumination, soil moisture, rainfall, and wind speed, are modeled through an environmental encoding branch. On this basis, a cross-modal contrastive consistency module is designed to align visual and environmental representations, a temporal consistency self-supervised module is introduced to characterize the continuous evolutionary relationship between pest occurrence and environmental changes, and a multimodal collaborative representation fusion module is constructed to adaptively integrate information from different modalities. The experimental results show that the proposed method achieves favorable performance in the pest recognition task, with Accuracy, Precision, Recall, and F1-score reaching 94.37%, 93.96%, 93.42%, and 93.69%, respectively, outperforming ConvNeXtV2-T, ViT-B/16, Swin-T, SimCLR, MAE, and the conventional Image + Sensor fusion method. The ablation experiments further show that, after removing the cross-modal contrastive consistency module, the temporal consistency self-supervised module, and the multimodal collaborative fusion module, the F1-score decreases to 91.00%, 91.36%, and 90.49%, respectively, thereby demonstrating the contribution of each module. This study provides a viable multimodal self-supervised learning approach for AI-driven intelligent pest recognition, early warning, and precision control in agriculture.
Xiao et al. (Mon,) studied this question.