What question did this study set out to answer?

The aim is to improve gaze target detection by modeling temporal variations during gaze transitions in video.

March 16, 2026

Transition-aware Path and Direction Variation Modeling for Gaze Target Detection in Video

Key Points

The aim is to improve gaze target detection by modeling temporal variations during gaze transitions in video.
Developed a Transition-aware Gaze Model (TGM) incorporating multiple modules for gaze analysis.
Implemented a frame gaze model using a Transformer to extract gaze location and direction features.
Introduced Temporal Variation Modeling (TVM) to capture path and direction variations during transition frames.
Employed cross-attention to fuse transition-aware features with frame features for enhanced detection.
Achieved state-of-the-art performance on the VideoAttentionTarget and VideoCoAtt datasets.
Demonstrated improved gaze target localization accuracy during video transitions.

Abstract

Gaze target detection aims to localize a person's gaze target. During gaze transition in video, the absence of accurate temporal variation modeling may lead to errors in gaze target localization. In this work, we propose a Transition-aware Gaze Model ( TGM ), which focuses on analyzing temporal differences to achieve accurate location variation modeling. The TGM contains four key components: a frame gaze model, and three transition-aware modules (path variation, direction variation, and fusion). First , the frame Transformer extracts gaze location and direction features. Second , to analyze the feature difference among transition frames, we introduce Temporal Variation Modeling ( TVM ) guided by transition-aware loss. TVM analyzes the location features to capture the moving trajectory of targets (defined as path variation ), which facilitates the search for target locations near the path. Third , TVM also analyzes the direction features to capture the transition-aware direction area (defined as direction variation ), which facilitates the search for target locations within this area. Fourth , since gaze directions dynamically adjust to track gaze targets, path variation and direction variation are inherently aligned with the natural movement of a person's gaze. Thus, these two variations are fused into a unified transition-aware feature, which helps cover all potential target locations. To search for accurate target locations, we embed this transition-aware feature into frame features with cross-attention, which can enhance gaze target detection in transition frames. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two datasets, namely VideoAttentionTarget and VideoCoAtt.

Bookmark

Cite This Study

Yang et al. (Sat,) studied this question.

synapsesocial.com/papers/69b79e488166e15b153ab5f4 https://doi.org/https://doi.org/10.1145/3799429

Bookmark