June 16, 2024

YOLO-World: Real-Time Open-Vocabulary Object Detection

Key Points

Key points are not available for this paper at this time.

Abstract

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the finetuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation. Code and models are available at: https://github.com/AILab-eve/YOLO-World.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Tianheng Cheng

Lin Song

Yixiao Ge

Actions

Institutions

Huazhong University of Science and Technology

Tencent (China)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

YOLO-World: Real-Time Open-Vocabulary Object Detection

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study