The newly proposed multimodal transformer architecture offers a new paradigm for UAV detection and aerial object recognition. It introduces an innovative way of feeding multiple data streams, such as audio, infrared video, RGB video, and radar, into the architecture for processing, using independent modalities. The unique features of each modality are attached and processed together in the architecture, where the features are then exposed to the multimodal transformer for classification. Thus, all complementary information can be pooled within the integration framework to allow the model discrimination of any drone target under outdoor conditions from other aerial objects such as birds, helicopters, and airplanes. These methodologies are expected to outperform traditional single-modality systems by improving detection accuracy through class balancing and addressing modality-specific limitations. The proposed model has been further tested through various experiments to evaluate its robustness under conditions such as missing entries, corrupted data, and synthetic inputs. The results suggest that it has strong potential to serve as a benchmark in UAV detection. Thus, this work takes part of an emerging body of sensor fusion and deep learning-related research, demonstrating the potential of multimodal data in real-world detection problems.
Larrat et al. (Thu,) studied this question.