One of the remaining challenges in deploying 3D CNN models in resource-constrained environments is the high computational demand. In this paper, we design three lightweight architectures that have distinct spatiotemporal topologies, namely, Lite-R21D, Lite-MC3, and Lite-LF, to reduce computational cost. However, these compact models have restricted representational capacity, which consequently limits their ability to capture complex spatiotemporal features. To overcome this, we employ Knowledge Distillation (KD) and further investigate hybrid combinations of response-based, spatiotemporal attention, and intermediate feature alignment paradigms. By analyzing knowledge transfer across these diverse architectures, our experiments on UCF101 and HMDB51 demonstrate that combining these distillation configurations consistently outperforms single KD methods, resulting in a substantial increase in accuracy across all Student models. Our optimal hybrid setup achieves 92.07% accuracy on UCF101 and 65.56% on HMDB51, compared to the Teacher’s 94.74% and 69.48%, reducing the accuracy gap to only 2.67% and 3.92%. These gains are achieved alongside significant efficiency improvements. The proposed models operate with up to 87% fewer parameters and an 89% reduction in Floating-Point Operations (FLOPs), achieving 6.7× faster inference. Our findings highlight that hybrid distillation is an effective approach for transferring and utilizing complex spatiotemporal knowledge in lightweight models.
Rasras et al. (Fri,) studied this question.