What question did this study set out to answer?

The aim is to develop a robust model for recognizing human activities using radar and camera inputs while ensuring user privacy.

February 2, 2026Open Access

View Full Paper

A Lightweight Radar–Camera Fusion Deep Learning Model for Human Activity Recognition

JYJeanie Yoo SWSungmin WooPohang University of Science and Technology

Key Points

The aim is to develop a robust model for recognizing human activities using radar and camera inputs while ensuring user privacy.
Developed a radar-camera fusion deep learning model
Processed radar data as a Range-Doppler-Time cube
Used a privacy-preserving ultra-low-resolution camera
Employs Transformer-based models for both radar and camera inputs
Constructed a multimodal dataset of synchronized radar and camera sequences
Achieved a classification accuracy of 98.74%
Outperformed single-modality baselines significantly
Model efficiency is demonstrated with only 11 million floating-point operations

Abstract

Human activity recognition in privacy-sensitive indoor environments requires sensing modalities that remain robust under illumination variation and background clutter while preserving user anonymity. To this end, this study proposes a lightweight radar–camera fusion deep learning model that integrates motion signatures from FMCW radar with coarse spatial cues from ultra-low-resolution camera frames. The radar stream is processed as a Range–Doppler–Time cube, where each frame is flattened and sequentially encoded using a Transformer-based temporal model to capture fine-grained micro-Doppler patterns. The visual stream employs a privacy-preserving 4×5-pixel camera input, from which a temporal sequence of difference frames is extracted and modeled with a dedicated camera Transformer encoder. The two modality-specific feature vectors—each representing the temporal dynamics of motion—are concatenated and passed through a lightweight fully connected classifier to predict human activity categories. A multimodal dataset of synchronized radar cubes and ultra-low-resolution camera sequences across 15 activity classes was constructed for evaluation. Experimental results show that the proposed fusion model achieves 98.74% classification accuracy, significantly outperforming single-modality baselines (single-radar and single-camera). Despite its performance, the entire model requires only 11 million floating-point operations (11 MFLOPs), making it highly efficient for deployment on embedded or edge devices.

AI에게 질문

Bookmark

View Full Paper

AI에게 질문

Bookmark

View Full Paper

A Lightweight Radar–Camera Fusion Deep Learning Model for Human Activity Recognition

Key Points

Abstract

Cite This Study