Robotic vision modules in low-volume high-mixture scenarios frequently need to be adapted to new requirements, but retraining or fine-tuning well-established object detection models may be too slow and resource intensive. Open-vocabulary object detection is a promising alternative, and fine tuning the prompt embeddings can solve situations where text prompts are not sufficient. We propose coupling this so-called prompt tuning with a vector database for retrieving the best prompts to differentiate challenging scenarios. This enables iterative improvement without much impact on the model’s performance with respect to previous requirements. We implemented the proposed method by adapting Grounding DINO and experimentally verified its effectiveness using the LVIS benchmark.
Fixl et al. (Thu,) studied this question.