What question did this study set out to answer?

The research aims to create a framework for accurately positioning virtual objects in images based on textual descriptions.

February 25, 2026

Multimodal Large Language Model for Virtual Object Grounding

Key Points

The research aims to create a framework for accurately positioning virtual objects in images based on textual descriptions.
Introduced a new task called Virtual Object Grounding (VOG).
Constructed the Virtual Segmentation dataset (VirtualSeg) with over 92,000 samples.
Developed the VirLLaVA model utilizing learnable tokens and a dual grounding module.
Employed a four-step dataset construction pipeline using CLIP for quality control.
VirLLaVA significantly improves virtual object grounding performance.
Enabled reasoning of object positions from both textual and visual inputs.
Demonstrated potential for consistent and automated image editing.

Abstract

We propose a novel task, V irtual O bject G rounding (VOG). It aims to predict plausible locations in an image for inserting virtual objects that align with a given textual description. This VOG task can address the challenge of providing region constraints for object insertion in image editing, thereby ensuring the consistency of irrelevant areas in the image. To support this task, we construct Virtual Seg mentation dataset (VirtualSeg), a dataset of over 92, 000 samples automatically generated from VrR-VG via a four-step dataset construction pipeline. This pipeline employs CLIP to automatically filter out low-quality data samples, ensuring the quality of VirtualSeg. Furthermore, we propose the VirLLaVA model, a novel virtual object grounding framework built upon LLaVA-7B. By equipping the MLLM backbone with two sequences of learnable tokens and a dual grounding module, and by guiding the model during training to learn step-by-step how to locate virtual objects, our method enables it to reason about their positions from textual and visual inputs. Experiments show that VirLLaVA significantly improves performance in virtual object grounding, while also offering a promising direction for consistent and automated image editing. The code and dataset are available at https: //github. com/Royxia0818/MLLMforVOG.

Bookmark

Multimodal Large Language Model for Virtual Object Grounding

Key Points

Abstract

Cite This Study