• We propose a plug-and-play global and local collaborative fusion method to improve the performance of weakly supervised object detection. • We design a pixel-level global information awareness module that utilizes singular value decomposition for image reconstruction. • We propose a local detail fusion module to enable the visual encoder to learn detailed information about target objects. • We demonstrate the effectiveness and superiority of our plug-and-play method through extensive experiments. Weakly supervised object detection (WSOD) has drawn much attention due to its closeness to practical applications, and researchers have proposed the multi-instance learning (MIL) approach to handle it as a multi-class classification problem. Although these methods have yielded promising results, extraneous information in the images severely affects the model’s feature learning due to the lack of instance-level annotation. To alleviate this limitation, in this paper, a global and local collaborative fusion method is proposed for WSOD by leveraging the complementary information of the original image and its low-rank approximation. Specifically, we design a pixel-level global information awareness (GIA) module to reconstruct the input image and remove redundant noise, which are then fed into a visual encoder to extract the features from a global perspective. Moreover, to compensate for the lack of detail preservation in GIA, we further propose a local detail fusion (LDF) module that fuses image details by leveraging both reconstructed and input images. Our proposed GIA-LDF modules are architecture-agnostic and can be seamlessly embedded into any MIL-based WSOD pipeline. Extensive experiments validate the effectiveness of our plug-and-play GIA-LDF for WSOD. We achieve 60.2%, 57.4%, and 23.2% mAP on PASCAL VOC 2007, VOC 2012, and COCO, respectively, surpassing baseline methods by +2.0%, +1.2%, and +0.3%, and establishing new state-of-the-art performance across all benchmarks.
Liang et al. (Sun,) studied this question.