Multi-object tracking (MOT) in videos captured by Unmanned Aerial Vehicles (UAVs) is critically challenged by significant camera ego-motion, frequent occlusions, and complex object interactions. To address the limitations of conventional trackers that depend on static, rule-based association strategies, this paper introduces STC-SORT, a novel tracking framework whose core is a two-level reasoning architecture for data association. First, a Spatio-Temporal Consistency Graph Network (STC-GN) models inter-object relationships via graph attention to learn adaptive weights for fusing motion, appearance, and geometric cues. Second, these dynamic weights are integrated into a 4D association cost volume, enabling globally optimal matching across a temporal window. When integrated with an enhanced AEE-YOLO detector, STC-SORT achieves significant and statistically robust improvements on major UAV tracking benchmarks. It elevates MOTA by 13.0% on UAVDT and 6.5% on VisDrone, while boosting IDF1 by 9.7% and 9.9%, respectively. The framework also maintains real-time inference speed (75.5 FPS) and demonstrates substantial reductions in identity switches. These results validate STC-SORT as having strong potential for robust multi-object tracking in challenging UAV scenarios.
Ma et al. (Tue,) studied this question.