Key points are not available for this paper at this time.
Human-Object Interaction (HOI) detection plays a vital role in scene understanding, which aims to predict the HOI triplet in the form of . Existing methods mainly extract multi-modal features (e.g., appearance, object semantics, human pose) and then fuse them together to directly predict HOI triplets. However, most of these methods focus on seeking for self-triplet aggregation, but ignore the potential cross-triplet dependencies, resulting in ambiguity of action prediction. In this work, we propose to explore Self- and Cross-Triplet Correlations (SCTC) for HOI detection. Specifically, we regard each triplet proposal as a graph where Human, Object represent nodes and Action indicates edge, to aggregate self-triplet correlation. Also, we try to explore cross-triplet dependencies by jointly considering instance-level, semantic-level, and layout-level relations. Besides, we leverage the CLIP model to assist our SCTC obtain interaction-aware feature by knowledge distillation, which provides useful action clues for HOI detection. Extensive experiments on HICO-DET and V-COCO datasets verify the effectiveness of our proposed SCTC.
Building similarity graph...
Analyzing shared references across papers
Loading...
Weibo Jiang
Weihong Ren
Jiandong Tian
University of Hong Kong
Harbin Institute of Technology
Shenyang Institute of Automation
Building similarity graph...
Analyzing shared references across papers
Loading...
Jiang et al. (Sun,) studied this question.
www.synapsesocial.com/papers/68e72962b6db6435876a338c — DOI: https://doi.org/10.1609/aaai.v38i3.28031