What question did this study set out to answer?

To develop an efficient and interpretable framework for sign language recognition on resource-constrained devices.

February 6, 2026Open Access

An explainable hybrid CNN–transformer model for sign language recognition on edge devices using adaptive fusion and knowledge distillation

Key Points

To develop an efficient and interpretable framework for sign language recognition on resource-constrained devices.
Designed a hybrid CNN-Transformer model (TinyMSLR) for sign classification.
Implemented adaptive fusion gate for integrating local and contextual cues.
Applied dual-teacher knowledge distillation to enhance model efficiency and accuracy.
Evaluated the system using two public datasets in a multilingual setting.
Achieved 99.28% training accuracy and 99.01% validation accuracy.
F1-score of 98.96% reflects excellent performance in isolated-sign recognition.
Maintained a parameter count under 2.7 million.
Inference latency of 24 ms on CPUs and under 13.5 ms on edge GPUs.

Abstract

Despite recent advances in deep learning (DL) for sign language recognition (SLR), most existing systems remain limited to monolingual datasets, lack interpretability, and are too computationally intensive for real-time edge deployment. With the growing need for inclusive and real-time communication technologies, efficient and deployable SLR systems are of critical importance. This paper presents TinyMSLR, an explainable, lightweight framework designed for isolated-sign (gloss) classification on resource-constrained devices. TinyMSLR combines a ConvNeXt-Tiny encoder for fine-grained local visual cues with a Swin Transformer encoder for long-range spatio-temporal context, and integrates an adaptive fusion gate to balance both streams. To further improve accuracy under strict compute and memory budgets, we introduce a dual-teacher knowledge distillation (KD) scheme that transfers complementary spatial and contextual knowledge from high-capacity CNN and Transformer teachers to the compact student model. We evaluate TinyMSLR in a controlled multilingual setting using two public datasets (DGS RWTH-PHOENIX-Weather 2014T and Mandarin CSL) by constructing a shared subset of 20 semantically aligned sign classes and segmenting RWTH continuous sequences into single-gloss clips. Therefore, all reported results correspond to isolated-sign recognition rather than continuous sentence-level multilingual CSLR. On this benchmark, TinyMSLR achieves 99.28% training accuracy and 99.01% validation accuracy, with an F1-score of 98.96%, while keeping the parameter count under 2.7M. Inference latency is 24 ms on standard CPUs and under 13.5 ms on edge GPUs. Overall, TinyMSLR demonstrates a practical accuracy-efficiency-explainability trade-off that is well aligned with deployment-ready multilingual isolated-sign systems on the edge.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper