Soccer video analysis has significant application value in sports broadcasting, tactical research, and athlete training, with accurate object detection serving as the key foundation for automated analysis. Soccer object detection typically improves performance through enhanced feature representation and optimized network architectures, but these methods assume that models can automatically learn discriminative features of targets. Through experiments, we reveal the “feature collapse” phenomenon in soccer detection, where features of players from the same team are excessively clustered in high-dimensional space, and soccer ball features degenerate to near background noise. Furthermore, existing methods lack progressive feature evolution mechanisms, resulting in insufficient discriminative capability when handling dense scenes. To address these issues, we propose DeCon-Net, which contains a Decoupled Feature Learning Module (DFLM) and a Hierarchical Contrastive Constraint Module (HCCM). Specifically, DFLM designs dual-stream encoders to extract appearance features and identity features separately, forcing the identity stream to learn truly discriminative representations through mutual exclusivity constraints. HCCM adopts dynamic threshold contrastive learning, adaptively adjusting learning intensity based on feature distances between sample pairs, achieving progressive optimization from coarse to fine granularity. Experimental results demonstrate that DeCon-Net achieves significant performance improvements on the SportsMOT and SoccerNet-Tracking datasets, particularly showing substantial gains in ball detection.
Ouyang et al. (Fri,) studied this question.