What question did this study set out to answer?

The aim is to improve dynamic video reconstruction from monocular views by addressing motion ambiguity and structural coherence.

January 24, 2026

Hie4DGS: Hierarchical 4D Gaussian Splatting from Monocular Dynamic Video.

Key Points

The aim is to improve dynamic video reconstruction from monocular views by addressing motion ambiguity and structural coherence.
Developed Hie4DGS, a hierarchical structure representation for dynamic motions.
Decomposed motions into groups of varying granularities for better modeling.
Utilized hierarchical semantic segmentation for Gaussian grouping.
Introduced a structure rendering loss to ensure consistency with semantic priors.
Achieved significant improvement in rendering quality compared to state-of-the-art methods.
Enhanced accuracy in modeling complex real-world motions from monocular video datasets.

Abstract

Monocular dynamic video reconstruction is a typical ill-posed problem due to the limited observations and complex 3D motions. Despite the recent advances in dynamic 3D Gaussian splatting techniques, most of them still struggle with the monocular setting, since they heavily rely on geometric cues from multiple cameras or ignore the structural coherence among the optimized 3D Gaussains. To address this, we propose Hie4DGS, a novel hierarchical structure representation to model the complex dynamic motions from monocular dynamic videos. Specifically, we decompose the motions of a dynamic scene into groups of multiple structure granularities and progressively compose them to derive the motion of each 3D Gaussian. Building on this representation, we leverage hierarchical semantic segmentation to group Gaussians and initialize their motion using depth and tracking priors within each group. Additionally, we introduce a structure rendering loss that enforces consistency between the learned motion structure and semantic priors, further reducing motion ambiguity. Compared to the state-of-the-art dynamic Gaussian methods, we achieve significant improvement in rendering quality on monocular video datasets featuring complex real-world motions.

اسأل الذكاء الاصطناعي

Bookmark