Gaze target detection aims to localize a person's gaze target. During gaze transition in video, the absence of accurate temporal variation modeling may lead to errors in gaze target localization. In this work, we propose a Transition-aware Gaze Model ( TGM ), which focuses on analyzing temporal differences to achieve accurate location variation modeling. The TGM contains four key components: a frame gaze model, and three transition-aware modules (path variation, direction variation, and fusion). First , the frame Transformer extracts gaze location and direction features. Second , to analyze the feature difference among transition frames, we introduce Temporal Variation Modeling ( TVM ) guided by transition-aware loss. TVM analyzes the location features to capture the moving trajectory of targets (defined as path variation ), which facilitates the search for target locations near the path. Third , TVM also analyzes the direction features to capture the transition-aware direction area (defined as direction variation ), which facilitates the search for target locations within this area. Fourth , since gaze directions dynamically adjust to track gaze targets, path variation and direction variation are inherently aligned with the natural movement of a person's gaze. Thus, these two variations are fused into a unified transition-aware feature, which helps cover all potential target locations. To search for accurate target locations, we embed this transition-aware feature into frame features with cross-attention, which can enhance gaze target detection in transition frames. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two datasets, namely VideoAttentionTarget and VideoCoAtt.
Yang et al. (Sat,) studied this question.