Traffic surveillance cameras, as core sensing devices in smart cities, are crucial for traffic management, violation detection, and autonomous driving. However, due to deployment constraints and hardware limitations, the videos they capture often suffer from low resolution and noise, leading to missed and false detections in traditional object detection algorithms trained on high-resolution data. To address this issue, this study proposes a cascaded collaborative framework that integrates video super-resolution (VSR) and object detection for robust perception in low-quality traffic surveillance scenarios. First, a transformer-based VSR model with masked intra- and inter-frame attention (MIA-VSR) is employed to reconstruct temporally coherent high-resolution video sequences from degraded inputs. A domain-specific super-resolved dataset is subsequently constructed to train a lightweight one-stage detector (You Only Look One-level Feature, YOLOF) for efficient vehicle localisation. Extensive experiments on public datasets (REDS, Vimeo90k, UA-DETRAC) demonstrate that the proposed framework achieved a 56.89 mAP@0.5 on low-resolution UA-DETRAC, outperforming both direct low-resolution inference (39.17 mAP@0.5) and conventional fine-tuning strategies (45.70 mAP@0.5) by 17.72 and 11.19 points, respectively. These findings indicate that super-resolution-driven data reconstruction provides an effective pathway for mitigating feature degradation in low-quality surveillance environments, offering both theoretical insight and practical value for intelligent transportation perception systems.
Yu et al. (Sun,) studied this question.