This paper presents the study of integrating a deep motion model into simple online and real-time tracking for video multi-object tracking. The tracking-by-detection paradigm faces significant challenges in handling nonlinear motion and occlusions. Although conventional Kalman-filter-based methods such as the SORT are efficient, they suffer from error accumulation because of their linear motion assumption. We propose KalmanFormer, a novel framework that enhances Kalman-filter-based tracking through adaptive motion modeling for video sequences. KalmanFormer consists of three key components. First, the inner-trajectory motion corrector, built upon the transformer architecture, refines Kalman filter predictions by learning nonlinear residuals from historical trajectories, thereby improving adaptability to complex motion patterns in videos. Second, the cross-trajectory attention module captures interobject motion correlations, significantly boosting object association under occlusions. Third, a pseudo-observation generator is integrated to provide neural-based predictions when detections are missing, stabilizing the Kalman filter update process. To validate our approach, we conduct comprehensive evaluations on the video benchmarks DanceTrack, MOT17, and MOT20 to demonstrate its effectiveness in handling complex motion and occlusion. The experimental results on the DanceTrack, MOT17, and MOT20 benchmarks demonstrate that KalmanFormer achieves competitive performance, with HOTA scores of 66.6 on MOT17 and 63.2 on MOT20, and strong identity preservation, IDF1: 82.0% and 80.1%, respectively.
Hong et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: