August 17, 2025Open Access

Towards More Generalisable Compositional Feature Learning in Human‐Object Interaction Detection

Key Points

Improved generalisability of feature learning leads to better detection outcomes for rare classes.
Experiments show significant performance enhancements on benchmark datasets for both rare and non-rare classes.
Proposed approach utilizes a transformer-based model to integrate global context effectively.
Implementation of advanced learning mechanisms highlights the potential for broader applications in HOI detection.

Abstract

ABSTRACT The long‐tailed distribution of training samples is a fundamental challenge in human‐object interaction (HOI) detection, leading to extremely imbalanced performance on non‐rare and rare classes. Existing works adopt the idea of compositional learning, in which object and action features are learnt individually and re‐composed into new samples of rare HOI classes. However, most of these methods are proposed on traditional CNN‐based frameworks which are weak in capturing image‐wide context. Moreover, the simple feature integration mechanisms fail to aggregate effective semantics in re‐composed features. As a result, these methods achieve only limited improvements on knowledge generalisation. We propose a novel transformer‐based compositional learning framework for HOI detection. Human‐object pair features and interaction features containing rich global context are extracted, and comprehensively integrated via the cross‐attention mechanism, generating re‐composed features containing more generalisable semantics. To further improve re‐composed features and promote knowledge generalisation, we leverage the vision‐language model CLIP in a computation‐efficient manner to improve re‐composition sampling and guide the interaction feature learning. Experiments on two benchmark datasets prove the effectiveness of our method in improving performance on both rare and non‐rare HOI classes.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper