What question did this study set out to answer?

This work aims to create an end-to-end perception framework for traffic monitoring using UAVs that unifies several key detection and tracking tasks.

May 28, 2026Open Access

A unified deep learning-driven UAV vehicle detection using robust feature extraction for intelligent traffic monitoring system

Key Points

This work aims to create an end-to-end perception framework for traffic monitoring using UAVs that unifies several key detection and tracking tasks.
Developed a multi-task framework integrating various advanced algorithms for vehicle detection, tracking, and classification.
Utilized UAVID and VAID benchmarks for evaluation on an NVIDIA RTX 4090 GPU, assessing multiple performance metrics.
Incorporated diffusion-based image restoration and state-of-the-art architectures for enhanced image quality and real-time processing.
Achieved 98% detection precision, 97% tracking accuracy, and 95% classification performance.
Demonstrated improved robustness under occlusion, scale variation, and low illumination conditions.
Consistently outperformed existing state-of-the-art methods across multiple performance metrics.

Abstract

Rapid urbanization has increased traffic congestion and safety risks, creating a demand for reliable unmanned aerial vehicle (UAV)-based traffic monitoring systems capable of operating under low illumination, dense traffic, and frequent occlusions. However, most existing approaches address detection, tracking, classification, and trajectory prediction as isolated tasks, leading to fragmented pipelines, error propagation, and limited robustness in real-world environments. This work aims to develop a unified end-to-end perception framework that simultaneously performs multi-vehicle detection, tracking, counting, classification, and trajectory forecasting within a single scalable architecture. The proposed framework integrates diffusion-based image restoration for low-quality UAV imagery, Mask2Former for multi-level segmentation, YOLOv9-HG for high-recall detection, MOTRv3 for end-to-end tracking, STCNet for density-aware counting, and STGFormer for spatio-temporal trajectory prediction, supported by Swin Transformer V2 and vector-quantized feature compression for efficient representation learning. Experiments were conducted on the UAVID and VAID benchmarks using an NVIDIA RTX 4090 GPU, evaluating precision, recall, F1-score, and tracking consistency across multiple runs. The framework achieves up to 98% detection precision, 97% tracking accuracy, and 95% classification performance, consistently outperforming existing state-of-the-art baselines. These results demonstrate improved robustness under occlusion, scale variation, and illumination changes. The proposed unified design enables reliable, real-time UAV-based traffic perception, making it suitable for intelligent transportation systems, congestion analysis, and smart city deployment.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper