What does this research mean for the field?

Selecting task-specific image representations—using local features for instance detection and Transformer-extracted context for interaction detection—improves the performance of human-object interaction detection compared to shared-representation methods. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The study aims to improve human-object interaction detection by tailoring image representations for different sub-tasks.

June 14, 2026

SARS: Selected Appropriate Representation for Sub-tasks in Human-Object Interaction Detection

Key Points

The study aims to improve human-object interaction detection by tailoring image representations for different sub-tasks.
Developed a new human-object interaction detection framework with distinct representations for instance detection and interaction classification.
Utilized local feature learning for instance detection and Transformer architecture for interaction detection to enhance context information extraction.
Predicted offsets from human and object to interaction points during the matching stage for accurate pairing.
The proposed method outperforms existing algorithms in terms of accuracy for both instance detection and interaction classification.
Significant improvements in interaction detection metrics were observed, though specific numerical values are not provided.

Abstract

Human-object interaction detection methods usually adopt a multi-task framework, including a feature learning backbone and two sub-tasks: instance detection and interaction classification. These two sub-tasks shared the same image representation. However, different sub-task requires different image information. For instance detection, detector focused on the feature of local region while interaction predictor required the feature on bigger receptive field. To solve this problem, a new HOI detection framework is designed to select the appropriate representation for each sub-task. Specially, the local feature is learned to predict the center points of instances. For the interaction detection, Transformer is chosen to extract the context information to improve the accuracy. In the matching stage of human-object pair, the offsets from human and object to the interaction point are predicted to obtain more accurate interaction pair. Finally, the experimental results show that compared with the existing algorithms, the proposed method achieves the better performance.

اسأل الذكاء الاصطناعي

Bookmark