August 12, 2025

Unbiased Embodied Visual Representation Learning with Causal Inference and Cross-Modality Alignment

Key Points

The proposed UEVR model reduces perception bias in object goal navigation strategies, improving generalization in novel settings.
By employing causal inference, the model mitigates spurious association bias from semantic distributions in visual observations.
Cross-modality alignment integrates 3D geometry with 2D representations to counteract dynamic-view biases in visual data.
Extensive experiments validate the enhanced performance of the Causal-ObjectNav framework on MP3D and HM3D datasets.

Abstract

Object Goal Navigation (ObjectNav) in novel environments relies on comprehensive scene understanding, including precise visual perception and accurate modeling of spatial-semantic regularities. However, excessive attention to the hand-crafted scene representation in prevailing approaches leads to the neglect of the negative influence of the perception bias hidden in the visual observations. The hand-crafted semantic distribution in domestic environments causes the spurious association bias, while the semantic conflict bias arises due to the dynamic perspective changes. Biased visual perception significantly limits the generalization of the navigation strategy. In this paper, we propose the U nbiased E mbodied V isual R epresentation( UEVR ), which overcomes the perception biases using causal inference and cross-modality alignment. Specifically, we establish reasonable assumptions about confounders for multi-object features through our proposed Unbiased Causal R-CNN framework and eliminate the spurious associations bias through B ack-door I ntervention C ausal A djustment( BICA ) module during navigation. To overcome the dynamic-view bias hidden in 2D image features, we propose to employ the cross-modality alignment mechanism with the Geo metric Con straints( GeoCon ) to encode 3D geometry prior into the 2D representations. Finally, we design a modular ObjectNav framework integrated with UEVR named Causal-ObjectNav , which consists of the corner-based scene exploration module and target object discrimination module. Extensive experiments on the MP3D and HM3D datasets demonstrate the superiority of the unbiased navigation model over existing ObjectNav methods.

Mark Helpful

Bookmark

Relay