Cat behavior recognition in unconstrained videos is important for animal welfare monitoring and veterinary assessment, yet remains challenging because behavior cues are often carried by highly deformable and intermittently visible parts such as the head and tail. This study aims to improve clip-level cat behavior recognition under unstable part visibility in real-world videos. We propose PMTNet, a part-centric temporal network for cat behavior recognition under unstable part visibility. The framework first detects the cat body, head, and tail using a DEIM-based detector, then selects a detector according to video-domain continuity and stability, and finally models behavior from ROI appearance features and explicit geometric motion cues. The framework was developed and evaluated using a part-detection dataset of 4000 training images and 500 validation images, together with a cat behavior dataset of 1283 video clips across five categories. In the best-performing setting, PMTNet achieved 93.1% Top-1 Accuracy and 90.9% Macro-F1. Ablation studies further suggest that detector choice in the video domain, complementary part cues, and missing-aware fusion all contribute to the final recognition performance. On the present dataset, PMTNet also outperformed representative end-to-end video recognition baselines. These results support the use of part-centric temporal modeling for cat behavior recognition in unconstrained real-world videos.
Tu et al. (Sat,) studied this question.