Human action recognition in close-contact sports is hindered by mutual occlusion, rapid pose changes, and distracting backgrounds. We study freestyle wrestling—a representative close-contact setting with sustained physical interaction—and present the Open FSW dataset of 210 trimmed clips covering seven techniques (30 clips per class), sourced from both controlled training sessions and broadcast footage. We introduce a foreground-aware RGB pipeline that segments athletes with a fine-tuned DeepLabV3+ model, extracts per-frame features using CNN backbones (VGG16, InceptionV3, EfficientNet-B7), and aggregates them with a bidirectional LSTM to produce clip-level predictions. Under a group-aware six-fold cross-validation protocol stratified by match/session ID to reduce train–test contamination across related sequences, the best configuration (DeepLabV3+ (foreground) + EfficientNet-B7 + Bi-LSTM) attains 82.9% top-1 accuracy. Ablation results quantify the added value of foregrounding, showing consistent gains for the strongest backbone and the largest improvements on high-occlusion techniques, at the cost of additional inference latency due to segmentation. Due to the modest dataset size, we mitigate overfitting via transfer learning and extensive augmentation, and we frame conclusions as domain-specific to freestyle wrestling. The dataset and code are released. To comply with copyright constraints, the controlled subset is provided as processed clips, while the broadcast subset is released as annotations and clip metadata to enable reconstruction.
Rostamian et al. (Mon,) studied this question.