Open-vocabulary object detection (OVOD) aims to localize and recognize objects in images by leveraging category-specific textual inputs, including both known and novel categories. While existing methods excel in general scenarios, their performance significantly deteriorates in domain-specific fine-grained detection because of their heavy reliance on high-quality textual descriptions. In specialized domains, such textual descriptions are often affected by newly introduced terms or subjective human biases, limiting their applicability. In this paper, we propose an attribute decomposition–aggregation approach for the OVOD to address these challenges. By decomposing categories into fine-grained attributes and learning them in a multi-label manner, our method mitigates text quality issues caused by novel terms and human bias. During inference, unseen fine-grained category texts can be effectively represented by combining the decomposed attributes for detection. Even if the model learns the attributes, a key limitation of current methods is the insufficient utilization of textual attributes. To mitigate this issue, we propose an attribute-aggregation module that enhances the discriminative capability by emphasizing critical attributes for distinguishing target objects from foreground elements. To demonstrate the effectiveness of our OVOD framework, we evaluate our method on both our newly constructed military dataset and the public LAD dataset. Experimental results demonstrate that our method outperforms existing methods in domain-specific fine-grained open-vocabulary detection tasks.
Dou et al. (Mon,) studied this question.