What type of study is this?

This is a Quantitative Study study.

October 16, 2025

Hierarchical Multimodal Knowledge Matching for Training-Free Open-Vocabulary Object Detection

Key Points

The hierarchical multimodal knowledge matching method improves detection results for novel categories.
Using limited category-specific images, it builds object and attribute prototype knowledge for better representation.
The proposed training-free approach allows seamless integration into existing open-vocabulary object detection models.
Extensive experiments verify significant performance improvements across various datasets and model architectures.

Abstract

Open-Vocabulary Object Detection (OVOD) aims to leverage the generalization capabilities of pre-trained vision-language models for detecting objects beyond the trained categories. Existing methods mostly focus on supervised learning strategies based on available training data, which might be suboptimal for data-limited novel categories. To tackle this challenge, this paper presents a Hierarchical Multimodal Knowledge Matching method (HMKM) to better represent novel categories and match them with region features. Specifically, HMKM includes a set of object prototype knowledge that is obtained using limited category-specific images, acting as off-the-shelf category representations. In addition, HMKM also includes a set of attribute prototype knowledge to represent key attributes of categories at a fine-grained level, with the goal to distinguish one category from its visually similar ones. During inference, two sets of object and attribute prototype knowledge are adaptively combined to match categories with region features. The proposed HMKM is training-free and can be easily integrated as a plug-and-play module into existing OVOD models. Extensive experiments demonstrate that our HMKM significantly improves the performance when detecting novel categories across various backbones and datasets.

اسأل الذكاء الاصطناعي

Bookmark

اسأل الذكاء الاصطناعي

Bookmark

Hierarchical Multimodal Knowledge Matching for Training-Free Open-Vocabulary Object Detection

Key Points

Abstract

Cite This Study