Digital biomarkers derived from eye-tracking and facial expression hold significant potential for the non-invasive screening of cognitive decline (CD). However, existing approaches predominantly rely on single-task or feature engineering-based unimodal methods, which struggle to capture complex temporal behavioral patterns. While deep learning (DL) excels at extracting hierarchical features and intricate temporal dynamics from behavioral sequences, its application in this specific multimodal sensing domain remains exploratory. Addressing this gap, this study designed an assessment system integrating five multi-dimensional cognitive paradigms and collected eye-tracking and facial expression data from 20 healthy controls (HC) and 20 individuals with CD. For these multimodal sequences, we propose a deep neural network capable of multi-scale representation learning. By utilizing subspace exploration and multi-scale convolutions, this architecture extracts deep representations directly from data and incorporates a decision fusion mechanism to enhance diagnostic robustness. Experimental results demonstrate that our method achieves a 90% classification accuracy, outperforming machine learning models. Furthermore, statistical analyses conducted in this study validated several features associated with CD and also explored some novel potential behavioral patterns. This study confirms the feasibility of a DL framework based on eye-tracking and facial expression signals for identifying CD, providing a reference for developing objective and efficient digital screening tools.
Xue et al. (Thu,) studied this question.