Motion recognition in tennis presents unique challenges due to the sport’s high speed, fine-grained motion distinctions and dynamic environmental conditions such as varying lighting, complex backgrounds and multi-angle viewpoints. While modern vision models have made significant strides, they often fall short in real-time performance, robustness across diverse playing styles, and capturing subtle biomechanical cues critical for effective training feedback. To bridge this gap, we propose a novel enhancement of the self-supervised vision foundation model DINOv2, specifically tailored for tennis action understanding and auxiliary coaching. Our approach introduces a unified architecture comprising three key components: 1) frame-level feature extraction using a DINOv2 backbone; 2) a lightweight temporal modeling module that captures player-specific dynamics and court context; 3) task-specific heads for precise action classification and timing prediction. We further design an optimized encoder/decoder structure that prioritizes kinematically salient information during feature compression, ensuring fine-grained discrimination among similar strokes. Additionally, a multi-scale feature fusion strategy operating at multiple frame sampling rates enables the model to jointly reason over short-term execution details and long-term tactical patterns. Comprehensive experiments on a curated tennis dataset demonstrate that our method outperforms current state-of-the-art baselines in action recognition accuracy, hit timing estimation and stroke-type prediction—particularly under challenging real-world conditions.
Sun et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: