What does this research mean for the field?

Combining response-based, spatiotemporal attention, and intermediate feature alignment in a hybrid knowledge distillation framework significantly improves the accuracy of lightweight 3D CNNs for video action recognition while achieving substantial reductions in computational cost. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The aim is to enhance the performance and efficiency of lightweight 3D CNN architectures in video action recognition.

June 7, 2026Open Access

Hybrid Knowledge Distillation for Edge-Efficient Video Action Recognition: Improving Lightweight 3D CNNs via Joint Distillation

Key Points

The aim is to enhance the performance and efficiency of lightweight 3D CNN architectures in video action recognition.
Designed three lightweight architectures: Lite-R21D, Lite-MC3, and Lite-LF.
Employed hybrid knowledge distillation combining response-based methods, spatiotemporal attention, and intermediate feature alignment.
Conducted experiments on UCF101 and HMDB51 datasets to analyze knowledge transfer and performance.
Optimal hybrid setup achieved 92.07% accuracy on UCF101 and 65.56% on HMDB51, narrowing the accuracy gap to 2.67% and 3.92% respectively.
Models operate with 87% fewer parameters and 89% reduction in floating-point operations (FLOPs).
Achieved 6.7× faster inference compared to traditional models.

Abstract

One of the remaining challenges in deploying 3D CNN models in resource-constrained environments is the high computational demand. In this paper, we design three lightweight architectures that have distinct spatiotemporal topologies, namely, Lite-R21D, Lite-MC3, and Lite-LF, to reduce computational cost. However, these compact models have restricted representational capacity, which consequently limits their ability to capture complex spatiotemporal features. To overcome this, we employ Knowledge Distillation (KD) and further investigate hybrid combinations of response-based, spatiotemporal attention, and intermediate feature alignment paradigms. By analyzing knowledge transfer across these diverse architectures, our experiments on UCF101 and HMDB51 demonstrate that combining these distillation configurations consistently outperforms single KD methods, resulting in a substantial increase in accuracy across all Student models. Our optimal hybrid setup achieves 92.07% accuracy on UCF101 and 65.56% on HMDB51, compared to the Teacher’s 94.74% and 69.48%, reducing the accuracy gap to only 2.67% and 3.92%. These gains are achieved alongside significant efficiency improvements. The proposed models operate with up to 87% fewer parameters and an 89% reduction in Floating-Point Operations (FLOPs), achieving 6.7× faster inference. Our findings highlight that hybrid distillation is an effective approach for transferring and utilizing complex spatiotemporal knowledge in lightweight models.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper