Vehicle detection is a core task in smart city perception management and an important technical support for sustainable urban development and intelligent transportation optimization. In high-resolution unmanned aerial vehicle (UAV) remote sensing images, it faces challenges such as variable target scales, severe occlusion, and difficulty in modeling long-range dependencies. To address these issues, this study proposes the MCViM-YOLO algorithm, which integrates the local perception advantage of convolution with the global modeling capability of the state space model (Mamba). Based on YOLOv12, the algorithm reconstructs the neck network: it introduces the Mix-Mamba module (parallel multi-scale convolution and selective state space model) to simultaneously capture local details and global spatial dependencies, adopts the dual-factor calibration fusion module (DCFM) to adaptively fuse heterogeneous features, and employs a dual-branch attention detection head (DADH) to optimize the prediction of difficult samples (e.g., occluded, small-scale vehicles). Experiments on the VEBAI dataset demonstrate that our proposed model achieves an mAP@0.5 of 92.391% and a recall rate of 86.070%, with a computational complexity of 10.41 GFLOPs. The results show that the proposed method effectively improves the accuracy and efficiency of vehicle detection in complex remote sensing scenarios, provides technical support for traffic flow monitoring, low-carbon urban planning, and other sustainable applications, and offers an innovative paradigm for the deep integration of CNN and state space models with both theoretical research value and engineering application prospects.
Zhang et al. (Fri,) studied this question.