Rapid urbanization has increased traffic congestion and safety risks, creating a demand for reliable unmanned aerial vehicle (UAV)-based traffic monitoring systems capable of operating under low illumination, dense traffic, and frequent occlusions. However, most existing approaches address detection, tracking, classification, and trajectory prediction as isolated tasks, leading to fragmented pipelines, error propagation, and limited robustness in real-world environments. This work aims to develop a unified end-to-end perception framework that simultaneously performs multi-vehicle detection, tracking, counting, classification, and trajectory forecasting within a single scalable architecture. The proposed framework integrates diffusion-based image restoration for low-quality UAV imagery, Mask2Former for multi-level segmentation, YOLOv9-HG for high-recall detection, MOTRv3 for end-to-end tracking, STCNet for density-aware counting, and STGFormer for spatio-temporal trajectory prediction, supported by Swin Transformer V2 and vector-quantized feature compression for efficient representation learning. Experiments were conducted on the UAVID and VAID benchmarks using an NVIDIA RTX 4090 GPU, evaluating precision, recall, F1-score, and tracking consistency across multiple runs. The framework achieves up to 98% detection precision, 97% tracking accuracy, and 95% classification performance, consistently outperforming existing state-of-the-art baselines. These results demonstrate improved robustness under occlusion, scale variation, and illumination changes. The proposed unified design enables reliable, real-time UAV-based traffic perception, making it suitable for intelligent transportation systems, congestion analysis, and smart city deployment.
Mujtaba et al. (Fri,) studied this question.