Unconstrained fall detection is essential for real-world applications. However, it remains underexplored due to the scarcity of real-world fall data and the limited generalization ability of existing methods. To address these challenges, we first introduce HUST-FALL, a fine-grained text-video dataset for unconstrained fall detection, featuring diverse fall scenarios and rich semantic annotations. Building on this dataset, we propose Action-R1, a lightweight vision-language model that leverages structured textual guidance and reasoning to improve the understanding of fall events. In challenging cross-dataset tests, Action-R1 achieves an average F1 score of 0.827 on three benchmarks, significantly outperforming conventional CNN/RNN-based methods. Despite having only 1/16 the parameters, Action-R1 achieves competitive performance against MiniCPM-V 2.6, even surpassing it on UPFall by 116.22%. These results demonstrate that Action-R1 is a lightweight yet powerful solution for unconstrained fall detection in real-world scenarios.
Wu et al. (Thu,) studied this question.