Multi-object tracking (MOT) aims to localize multiple targets and maintain their identities across time in unconstrained videos. Despite recent progress in tracking-by-detection and end-to-end Transformer-based approaches, two persistent bottlenecks limit practical robustness: candidate quality and coverage within each frame, and the reliability of cross-frame association under low frame rate, occlusion, and rapid motion. We present a MOT framework, AuxTrack, that couples an auxiliary detection branch with principled filtering to expand recall while keeping noise controllable, and a spatio-temporal attention–based similarity decoder that integrates spatial layout awareness and temporal memory. The auxiliary branch shares backbone features but is recall-oriented, then filtered by a multi-dimensional consistency constraint. The similarity decoder fuses object queries and track queries via spatial attention with relative positional encoding and temporal attention, yielding stable association scores for Hungarian matching. The framework enhances candidate coverage and association reliability, yielding fewer identity switches and more complete tracks in dense, fast-motion scenes. Experiments on a sports-oriented MOT benchmark SportsMOT are designed to validate improvements of our approach in challenge scenarios.
Jiang et al. (Mon,) studied this question.