In high-frame-rate human–computer interaction and mobile-perception scenarios, single-frame human action recognition must meet stringent latency and accuracy constraints. To tackle spatial feature entanglement, multiscale fragmentation, and edge-deployment inefficiency, this study proposes YOLO11-AN (Action Net), a lightweight detector that couples a C3K2-DMAF dynamic multiscale fusion block, a dual-branch AUX head, an MPDIoU regression loss, and a LocalWindowAttention module. Comprehensive evaluations on Pascal VOC 2012, UCF101, and HMDB51 show that YOLO11-AN attains 0.537 mAP 50 on VOC—an absolute gain of 1.7 percentage points over the YOLO11 baseline—while maintaining an inter-seed variance below 0.001. Against peer-reviewed baselines (YOLOv8-n, PP-YOLOE-Tiny, and RT-DETR-R18), it offers the best accuracy–compute tradeoff, and after INT8 quantization sustains 15.8 FPS on a 4 GB Jetson Orin Nano, validating its suitability for real-time low-power deployments.
Ding et al. (Thu,) studied this question.