Learning 3D human-object interactions (HOI) from 2D images is one of the important approaches for understanding human-object interactions in 3D space and is crucial for the advancement of embodied AI and interaction modeling. Existing 3D human-object interaction learning methods often fail to model fine-grained interactions in complex scenarios due to their reliance on visual features alone, leading to ambiguities in human contact, object affordance, and spatial relation. To address this, we propose SKE-3DHOI, a semantic knowledge enhanced framework that integrates semantic knowledge derived from large multimodal models into visual 3D human-object interaction reasoning. By generating 3D HOI semantic knowledge tensors through HOI-specific textual queries of large multimodal models, our method encodes critical HOI semantics and fuses them with visual embeddings via cross-attention fusion layers. This enables explicit alignment of visual patterns with semantic knowledge priors. Extensive experiments validate that SKE-3DHOI achieves state-of-the-art performance, significantly outperforming existing methods across all metrics in 3D human-object interaction learning. The framework bridges the gap between geometric plausibility and semantic validity, advancing robust 3D HOI understanding.
Li et al. (Mon,) studied this question.