Human activity recognition in privacy-sensitive indoor environments requires sensing modalities that remain robust under illumination variation and background clutter while preserving user anonymity. To this end, this study proposes a lightweight radar–camera fusion deep learning model that integrates motion signatures from FMCW radar with coarse spatial cues from ultra-low-resolution camera frames. The radar stream is processed as a Range–Doppler–Time cube, where each frame is flattened and sequentially encoded using a Transformer-based temporal model to capture fine-grained micro-Doppler patterns. The visual stream employs a privacy-preserving 4×5-pixel camera input, from which a temporal sequence of difference frames is extracted and modeled with a dedicated camera Transformer encoder. The two modality-specific feature vectors—each representing the temporal dynamics of motion—are concatenated and passed through a lightweight fully connected classifier to predict human activity categories. A multimodal dataset of synchronized radar cubes and ultra-low-resolution camera sequences across 15 activity classes was constructed for evaluation. Experimental results show that the proposed fusion model achieves 98.74% classification accuracy, significantly outperforming single-modality baselines (single-radar and single-camera). Despite its performance, the entire model requires only 11 million floating-point operations (11 MFLOPs), making it highly efficient for deployment on embedded or edge devices.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jeanie Yoo
Sungmin Woo
Sensors
Korea University of Technology and Education
Building similarity graph...
Analyzing shared references across papers
Loading...
Yoo et al. (Thu,) studied this question.
www.synapsesocial.com/papers/6980fd60c1c9540dea80f108 — DOI: https://doi.org/10.3390/s26030894
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: