ABSTRACT Weapon detection in surveillance imagery is a critical task for ensuring public safety in high‐risk environments. While single‐stage object detectors like YOLOv8 provide exceptional localization speed, their semantic reliability often diminishes when distinguishing between lethal weapons and visually similar non‐threat objects in complex scenes. To address this limitation, this work proposes a hybrid weapon detection framework that decouples object localization from semantic verification. The system utilizes a YOLOv8n detector for real‐time region localization, followed by an attention‐enhanced classification stage incorporating a Squeeze‐and‐Excitation (SE) mechanism. A novel conditional decision logic was implemented to manage the consensus between the two stages, incorporating a confidence‐weighted fail‐safe to maintain high recall for unambiguous threats. We evaluated three backbone architectures—NASNetMobile, DenseNet201, and InceptionV3, under identical constraints. Experimental results on a primary test set show that the hybrid NASNetMobile + SE model achieves 96% accuracy, significantly outperforming the standalone YOLOv8n (94%). More critically, on a challenging ablation dataset, the hybrid framework improved pistol precision from 90% to 95%. Furthermore, the integration of the SE block was found to optimize inference, reducing latency compared to the vanilla backbone. The proposed system represents a “Pareto‐optimal” solution for edge‐deployed security infrastructure, prioritizing high‐precision firearm recognition without compromising operational latency.
Bithi et al. (Wed,) studied this question.