Understanding human activities in complex social environments from aerial perspectives represents a critical challenge in UAV-based surveillance and autonomous systems. We propose a comprehensive multi-modal framework integrating appearance-based and skeletal feature representations for robust social activity recognition. The system employs atmospheric correction, DeepLabv3 segmentation, Mask R-CNN detection, and DeepSORT tracking for preprocessing. Feature extraction combines PDE-based shape analysis, distance transforms, and heatmap representations with skeletal features, including information landscape analysis, UMAP manifold projection, and motion signatures. Our novel Feature Correlation and Structure Fusion (FC2FS) methodology optimally integrates these heterogeneous modalities. Spatial relationships are modeled using Relational Graph Convolutional Networks with multi-head attention, while Bidirectional LSTM networks capture temporal dependencies. Maximum Entropy Markov Models enable simultaneous individual and social activity classification. Evaluation on the Okutama-Action UAV dataset achieved 83.6% accuracy for individual actions and 91.5% for social activities, while the JRDB-Act robotics dataset yielded 85.7% and 93.2% accuracy, respectively. Our framework demonstrates a 15.24 percentage point improvement over existing UAV-specific methods, establishing new performance benchmarks with computational efficiency suitable for near real-time deployment, with significant implications for surveillance systems, autonomous robotics, and human behavior analysis applications.
Zahra et al. (Thu,) studied this question.