We propose a novel task, V irtual O bject G rounding (VOG). It aims to predict plausible locations in an image for inserting virtual objects that align with a given textual description. This VOG task can address the challenge of providing region constraints for object insertion in image editing, thereby ensuring the consistency of irrelevant areas in the image. To support this task, we construct Virtual Seg mentation dataset (VirtualSeg), a dataset of over 92, 000 samples automatically generated from VrR-VG via a four-step dataset construction pipeline. This pipeline employs CLIP to automatically filter out low-quality data samples, ensuring the quality of VirtualSeg. Furthermore, we propose the VirLLaVA model, a novel virtual object grounding framework built upon LLaVA-7B. By equipping the MLLM backbone with two sequences of learnable tokens and a dual grounding module, and by guiding the model during training to learn step-by-step how to locate virtual objects, our method enables it to reason about their positions from textual and visual inputs. Experiments show that VirLLaVA significantly improves performance in virtual object grounding, while also offering a promising direction for consistent and automated image editing. The code and dataset are available at https: //github. com/Royxia0818/MLLMforVOG.
Xia et al. (Mon,) studied this question.