What question did this study set out to answer?

The aim is to enhance motion recognition for tennis through a specialized vision model to improve training feedback.

June 10, 2026Open Access

Vision foundation model based intelligent method for tennis movement recognition and auxiliary training

Key Points

The aim is to enhance motion recognition for tennis through a specialized vision model to improve training feedback.
Proposed a self-supervised vision model enhancement based on DINOv2 for tennis.
Developed a unified architecture integrating frame-level feature extraction, temporal modeling, and task-specific classification heads.
Conducted experiments on a curated tennis dataset to validate performance against state-of-the-art models.
Achieved superior action recognition accuracy compared to state-of-the-art methods under real-world conditions.
Improved hit timing estimation and stroke-type prediction metrics were recorded, indicating better performance overall.
Demonstrated effective model robustness across diverse playing styles and challenging environments.

Abstract

Motion recognition in tennis presents unique challenges due to the sport’s high speed, fine-grained motion distinctions and dynamic environmental conditions such as varying lighting, complex backgrounds and multi-angle viewpoints. While modern vision models have made significant strides, they often fall short in real-time performance, robustness across diverse playing styles, and capturing subtle biomechanical cues critical for effective training feedback. To bridge this gap, we propose a novel enhancement of the self-supervised vision foundation model DINOv2, specifically tailored for tennis action understanding and auxiliary coaching. Our approach introduces a unified architecture comprising three key components: 1) frame-level feature extraction using a DINOv2 backbone; 2) a lightweight temporal modeling module that captures player-specific dynamics and court context; 3) task-specific heads for precise action classification and timing prediction. We further design an optimized encoder/decoder structure that prioritizes kinematically salient information during feature compression, ensuring fine-grained discrimination among similar strokes. Additionally, a multi-scale feature fusion strategy operating at multiple frame sampling rates enables the model to jointly reason over short-term execution details and long-term tactical patterns. Comprehensive experiments on a curated tennis dataset demonstrate that our method outperforms current state-of-the-art baselines in action recognition accuracy, hit timing estimation and stroke-type prediction—particularly under challenging real-world conditions.

Bookmark

View Full Paper