ABSTRACT The long‐tailed distribution of training samples is a fundamental challenge in human‐object interaction (HOI) detection, leading to extremely imbalanced performance on non‐rare and rare classes. Existing works adopt the idea of compositional learning, in which object and action features are learnt individually and re‐composed into new samples of rare HOI classes. However, most of these methods are proposed on traditional CNN‐based frameworks which are weak in capturing image‐wide context. Moreover, the simple feature integration mechanisms fail to aggregate effective semantics in re‐composed features. As a result, these methods achieve only limited improvements on knowledge generalisation. We propose a novel transformer‐based compositional learning framework for HOI detection. Human‐object pair features and interaction features containing rich global context are extracted, and comprehensively integrated via the cross‐attention mechanism, generating re‐composed features containing more generalisable semantics. To further improve re‐composed features and promote knowledge generalisation, we leverage the vision‐language model CLIP in a computation‐efficient manner to improve re‐composition sampling and guide the interaction feature learning. Experiments on two benchmark datasets prove the effectiveness of our method in improving performance on both rare and non‐rare HOI classes.
Liang et al. (Wed,) studied this question.