What question did this study set out to answer?

To develop a framework enabling part-level 3D editing guided by text prompts, overcoming current limitations.

April 4, 2026Open Access

A Diffusion Framework based on Prompt-Driven Masks for Part-level 3D Editing

Key Points

To develop a framework enabling part-level 3D editing guided by text prompts, overcoming current limitations.
Proposed Mask2-3D framework for part-level 3D editing
Utilized a learnable, multi-view mask generator for editing region prediction
Integrated a finetuned diffusion model for content synthesis and style transfer
Implemented a re-rendering process for maintaining multi-view consistency
Enabled precise and flexible local editing of 3D models
Facilitated significant geometric modifications and new shape architectures
Enhanced the intuitiveness of 3D content creation through natural language commands

Abstract

Abstract The advancement of generative Artificial Intelligence (AI), particularly with the advent of diffusion models and 3D Gaussian Splatting (3DGS), has introduced novel avenues for manipulating and synthesizing 3D models. However, current 3D editing methods primarily focus on global style transfers or constrained geometric deformations. They face significant challenges in executing fine-grained, part-level manipulations guided by text prompts, especially for complex tasks that require simultaneous changes to both geometry and appearance. Many existing approaches operate at the rendering level, which hinders the creation of new geometric structures. To overcome these limitations, we propose Mask2-3D, a diffusion-based framework for prompt-driven, part-level 3D editing. The core of our framework is a learnable, multi-view mask generator that predicts a coherent editing region rather than just segmenting existing contours. This unique mechanism provides the flexibility to create new shape architectures and undergo significant geometric modifications. Furthermore, the system integrates a LoRA-finetuned diffusion model to facilitate high-fidelity content synthesis and style transfer within these designated regions, while a subsequent re-rendering process ensures multi-view consistency. By implementing this innovative workflow, Mask2-3D enables precise, flexible, and structurally sound local editing of 3D models via natural language commands, significantly enhancing the intuitiveness and creative freedom of the 3D content creation process.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper