What does this research mean for the field?

A novel Masked Video Modeling Adaptive Transformer (MAT) utilizing a dynamic mask sampler and Residual Adaptive Block effectively recognizes and classifies tic actions in children from clinical video data. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This study aims to enhance the identification of tic activities associated with tic disorders in children.

May 16, 2026Open Access

A dynamic dual-processing action recognition framework for tic action recognition in children tic disorder

Key Points

This study aims to enhance the identification of tic activities associated with tic disorders in children.
Collected clinical data from videos of 80 children with tic disorders to create a tic action dataset.
Employed masked video modeling and a novel Masked Video Modeling Adaptive Transformer (MAT) for analysis.
Integrated a Residual Adaptive Block (RAB) to improve feature extraction and classification capabilities.
The proposed masked video modeling method significantly improved action recognition accuracy compared to traditional CNN and Transformer approaches.
Demonstrated robust performance on the tic action dataset and other datasets like Something–Something-V2 and Kinetics-400.

Abstract

The clinical diagnosis of tic disorders involves many complex processes, which requires long-term observation and analysis of the patient’s behaviors. This study aims to help identify the tic activities that are the typical symptoms in Tic Disorder. We collect real clinical data to produce a tic dataset. This dataset records consultation videos of 80 children tic patients and contains 13 categories of tic activities. We conduct masked image modeling (MIM) and masked video modeling on video data a two-stream pretraining strategy. A novel Masked Video Modeling Adaptive Transformer (MAT) is proposed, which contains a new masking strategy to obtain masks through our dynamic mask sampler. By this way, we aim to enhance the difficulty of self-supervised training and the robustness of the model. We also design a Residual Adaptive Block (RAB) in the backbone. By introducing 3D local feature learning into FFN, it can further improve the ability of feature expression required for scene understanding. Unlike traditional FFN modules, residual convolution structure is used to make full use of the feature information extracted by the model for classification purposes. On the tic action dataset, we compare our proposed approach with the recent CNN and Transformer action recognition methods. To give a fair comparison, we also conduct experiments on Something–Something-V2 and Kinetics-400. The experimental results demonstrate the effectiveness of our masked video modeling self-supervised approach for tic action recognition. Our code and abstract are publicly available at https://github.com/hzxie99/MAT .

A dynamic dual-processing action recognition framework for tic action recognition in children tic disorder

Key Points

Abstract

Cite This Study