What question did this study set out to answer?

This research aims to improve human activity recognition accuracy using advanced deep learning techniques with wearable sensors.

June 10, 2026Open Access

Intuitive Multi‐Scale Visual Feature Fusion for Automated Classification of Human Activity

Key Points

This research aims to improve human activity recognition accuracy using advanced deep learning techniques with wearable sensors.
Transformed raw multivariate sensor signals into structured image representations using spectrogram encoding.
Utilized Inception and Xception deep learning architectures for feature extraction and fusion.
Applied principal component analysis (PCA) for dimensionality reduction and processed features through multi-scale CNN, LSTM, and RNN classifiers.
Achieved an accuracy of 98.88% on the WISDM dataset.
Achieved an accuracy of 98.71% on the UCI-HAR dataset.
Achieved an accuracy of 98.71% on the PAMAP2 dataset.

Abstract

ABSTRACT Human activity recognition (HAR) with the help of wearable sensors has become a major research focus because of its broad application areas, such as healthcare monitoring, smart homes and human computer interaction. Yet, it is not easy to recognise activities accurately by using multivariate sensor data because the sensors can produce noisy signals, there can be redundant features and complex temporal dependencies make the task difficult. In our paper, we suggest a deep learning method that combines sensor‐to‐image conversion, feature‐level fusion, dimensionality reduction and multi‐scale classification to solve the above issues. Firstly, raw multivariate sensor signals are transformed into structured image representations with the use of spectrogram‐based encoding, thus enabling convolutional neural networks to grasp spatial patterns in temporal data quite well. Two deep architectures that complement each other, namely Inception and Xception, are used to obtain significant features from the generated images. Next, as a way of feature‐level fusion, the feature vectors extracted from the two networks are joined to harness the complementary information contained in both networks. After that, principal component analysis (PCA) is used to get a small reduced fused feature (RFF) representation in order to minimise feature redundancy and computational complexity. This reduced feature space is later processed through a common multi‐scale convolutional front‐end with kernel sizes of 3, 5 and 7, and then CNN, LSTM and RNN classifiers are used to represent spatial temporal activity patterns. As shown by the tests on the WISDM, UCI‐HAR and PAMAP2 datasets, the proposed method can achieve excellent results with accuracies of 98.88%, 98.71% and 98.71%, respectively.

Intuitive Multi‐Scale Visual Feature Fusion for Automated Classification of Human Activity

Key Points

Abstract

Cite This Study