The clinical diagnosis of tic disorders involves many complex processes, which requires long-term observation and analysis of the patient’s behaviors. This study aims to help identify the tic activities that are the typical symptoms in Tic Disorder. We collect real clinical data to produce a tic dataset. This dataset records consultation videos of 80 children tic patients and contains 13 categories of tic activities. We conduct masked image modeling (MIM) and masked video modeling on video data a two-stream pretraining strategy. A novel Masked Video Modeling Adaptive Transformer (MAT) is proposed, which contains a new masking strategy to obtain masks through our dynamic mask sampler. By this way, we aim to enhance the difficulty of self-supervised training and the robustness of the model. We also design a Residual Adaptive Block (RAB) in the backbone. By introducing 3D local feature learning into FFN, it can further improve the ability of feature expression required for scene understanding. Unlike traditional FFN modules, residual convolution structure is used to make full use of the feature information extracted by the model for classification purposes. On the tic action dataset, we compare our proposed approach with the recent CNN and Transformer action recognition methods. To give a fair comparison, we also conduct experiments on Something–Something-V2 and Kinetics-400. The experimental results demonstrate the effectiveness of our masked video modeling self-supervised approach for tic action recognition. Our code and abstract are publicly available at https://github.com/hzxie99/MAT .
Wang et al. (Fri,) studied this question.