Deep learning has significantly advanced object detection and scene understanding, yet analyzing uncrewed aerial vehicle (UAV)-based hyperspectral imagery remains challenging due to its high spectral complexity and cluttered scenes. Most existing approaches address either detection or segmentation separately, or rely on limited data modalities, which constrains their effectiveness in complex aerial scenarios. To address this gap, we propose AeroResNet-Vision (ARV), a novel multimodal fusion framework for UAV imagery. ARV integrates state-of-the-art techniques into a unified pipeline: a ResNet-based DeepLabv3++ module for multi-scale semantic segmentation, classical Binary Robust Invariant Scalable Keypoints (BRISK), Maximally stable Extrema Regions (MSER) detectors for keypoint and region feature extraction, (You Only Look Once, version 8) YOLOv8 for efficient object detection, and a Vision Transformer (ViT) for context-aware classification. Additionally, image normalization and edge detection preprocessing are applied to enhance image quality and emphasize structural features. By fusing handcrafted and deep features from these components, ARV effectively handles varying object scales and complex backgrounds in aerial data. Experimental results on three benchmark datasets (ISPRS Potsdam, VEDAI, UAVid) demonstrate state-of-the-art accuracy—for instance, ARV achieved 97.20%, 98.50%, and 97.60% accuracy on these datasets, respectively. These findings validate the framework’s superior performance in UAV image analysis. In conclusion, the proposed multimodal approach provides a robust solution for aerial object recognition, and we recommend exploring lightweight model variants and self-supervised learning to further enhance its deployment potential.
Abdelhaq et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: