Human-object interaction detection methods usually adopt a multi-task framework, including a feature learning backbone and two sub-tasks: instance detection and interaction classification. These two sub-tasks shared the same image representation. However, different sub-task requires different image information. For instance detection, detector focused on the feature of local region while interaction predictor required the feature on bigger receptive field. To solve this problem, a new HOI detection framework is designed to select the appropriate representation for each sub-task. Specially, the local feature is learned to predict the center points of instances. For the interaction detection, Transformer is chosen to extract the context information to improve the accuracy. In the matching stage of human-object pair, the offsets from human and object to the interaction point are predicted to obtain more accurate interaction pair. Finally, the experimental results show that compared with the existing algorithms, the proposed method achieves the better performance.
Zhang et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: