Abstract Auditory attention decoding (AAD) aims to detect the target speaker from electroencephalography (EEG) signals in multi‐talker environments. Existing methods often insufficiently exploit joint spatial and temporal information, which limits decoding performance. This paper presents STHANet (spatiotemporal hybrid attention network), a dual‐branch model that integrates depth‐wise spatial filtering, log–variance temporal characterization, and transformer‐based spatiotemporal fusion. Experiments on the KUL and DTU datasets show that STHANet achieves competitive performance with a 1‐s decision window, reaching accuracies of 93.6% and 75.8% under within‐trial partitioning and 76.7% and 66.1% under strict cross‐trial partitioning, respectively. Further evaluation on the AV–GC–AAD dataset under moving‐target gaze‐incongruent conditions shows that all evaluated direct AAD models decrease to near‐chance performance when gaze‐related shortcuts are more strictly controlled, whereas only STHANet remains significantly above chance. These findings support the effectiveness of STHANet for spatiotemporal EEG feature extraction and highlight the importance of controlling both data‐partitioning bias and gaze‐related confounds in direct AAD.
Xu et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: