What question did this study set out to answer?

This study investigates how data splitting strategies affect the performance estimation of machine learning and deep learning models in human activity recognition tasks.

May 10, 2026Open Access

How improper dataset split hinders model generalizability: a systematic comparison in Human activity recognition and exercise evaluation tasks

Key Points

This study investigates how data splitting strategies affect the performance estimation of machine learning and deep learning models in human activity recognition tasks.
Performed experiments using the NTU RGB+D 120 dataset and IntelliRehabDS for model training.
Trained 12 machine learning and deep learning models across various tasks to assess performance.
Applied Generalized Linear Mixed-Effects models for predictive variance decomposition to analyze model stability based on data splits.
NCS splits overestimated model performance, particularly in complex tasks, with statistical significance observed in deep learning models.
Variance decomposition indicated that higher subject differences in training and test sets increased predictive instability.
CS splitting improved model generalizability by reducing variance and enhancing stability.

Abstract

BACKGROUND: Human Activity Recognition (HAR) and exercise assessment models are increasingly used in healthcare to support clinical evaluation, rehabilitation, and remote monitoring. However, their real-world applicability critically depends on the ability to generalize across unseen subjects, whose movement patterns may differ substantially due to inter-individual variability. Despite this, many studies adopt random noncross-subject (NCS) data splits, where samples from the same individual appear in both training and test sets, potentially leading to overly optimistic and clinically misleading performance estimates. OBJECTIVE: We investigate (i) how NCS and cross-subject (CS) splits affect performance estimation across machine learning and deep learning models under tasks of increasing complexity, (ii) how data splitting and differences between training and test sets contribute to predictive variance and stability. METHODS: Experiments were performed using a large-scale HAR benchmark dataset (NTU RGB+D 120) and a rehabilitation-specific dataset (IntelliRehabDS). A total of 12 machine learning and deep learning models were trained across both tasks, and their performance was estimated and compared using a simulation-based approach. Predictive variance decomposition, via Generalized Linear Mixed-Effects models, was applied to link the split strategy and differences in training and test instances to model output stability. RESULTS: NCS splits consistently overestimated model performances, with discrepancies increasing alongside task and model complexity. DL architectures, in particular, showed markedly higher NCS performance compared to CS splits, generally with statistical significance. Variance decomposition revealed that greater subject difference between training and test sets often enhances predictive instability, while CS splitting reduces variance by promoting more generalizable representations. CONCLUSIONS: Improper dataset splits can mislead model evaluation, exaggerate generalization capabilities, and undermine clinical trust. Our study provides empirical evidence for computer vision-based rehabilitation models and offers methodological guidance for robust evaluation practices, supporting reproducible and trustworthy AI deployment in rehabilitation and broader healthcare applications.

How improper dataset split hinders model generalizability: a systematic comparison in Human activity recognition and exercise evaluation tasks

Key Points

Abstract

Cite This Study