What does this research mean for the field?

A Deep Bidirectional Long Short-Term Memory Recurrent Neural Network (DBLSTM-RNN) based multimodal affect prediction framework effectively fuses audio, video, and physiological features, achieving a concordance correlation coefficient of 0.747 for arousal and 0.609 for valence. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

October 13, 2015

Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks

Q: What is the clinical evidence from this study?

Study design: Other. Population: Affective dimension prediction. Intervention: Deep Bidirectional Long Short-Term Memory Recurrent Neural Network (DBLSTM-RNN) based multimodal affect prediction framework vs. Baseline features. Primary outcome: Concordance correlation coefficient (CCC) for arousal and valence.

Key Result

A Deep Bidirectional Long Short-Term Memory Recurrent Neural Network based multimodal affect prediction framework achieved a concordance correlation coefficient of 0.747 for arousal and 0.609 for valence on the test set.

Structured PICO

Population

Datasets from the Audio-Visual Emotion Challenge (AV+EC 2015)

Intervention

Deep Bidirectional Long Short-Term Memory Recurrent Neural Network (DBLSTM-RNN) based multimodal affect prediction framework using audio, video, and physiological (ECG, EDA) features

Comparator

Baseline features and standard correlation methods

Outcome

Concordance correlation coefficient (CCC) for arousal and valencesurrogate

A DBLSTM-RNN based multimodal framework using audio, video, and physiological features achieved promising results for predicting affective dimensions.

Abstract

This paper presents our system design for the Audio-Visual Emotion Challenge (AV^+EC 2015). Besides the baseline features, we extract from audio the functionals on low-level descriptors (LLDs) obtained via the YAAFE toolbox, and from video the Local Phase Quantization from Three Orthogonal Planes (LPQ-TOP) features. From the physiological signals, we extract 52 electro-cardiogram (ECG) features and 22 electro-dermal activity (EDA) features from various analysis domains. The extracted features along with the AV^+EC 2015 baseline features of audio, ECG or EDA are concatenated for a further feature selection step, in which the concordance correlation coefficient (CCC), instead of the usual Pearson correlation coefficient (CC), has been used as objective function. In addition, offsets between the features and the arousal/valence labels are considered in both feature selection and modeling of the affective dimensions. For the fusion of multimodal features, we propose a Deep Bidirectional Long Short-Term Memory Recurrent Neural Network (DBLSTM-RNN) based multimodal affect prediction framework, in which the initial predictions from the single modalities via the DBLSTM-RNNs are firstly smoothed with Gaussian smoothing, then input into a second layer of DBLSTM-RNN for the final prediction of affective state. Experimental results show that our proposed features and the DBLSTM-RNN based fusion framework obtain very promising results. On the development set, the obtained CCC is up to 0. 824 for arousal and 0. 688 for valence, and on the test set, the CCC is 0. 747 for arousal and 0. 609 for valence.

Mark Helpful

Bookmark

Relay

Mark Helpful

Bookmark

Relay

Cite This Study

He et al. (Tue,) conducted a other in Affective dimension prediction. Deep Bidirectional Long Short-Term Memory Recurrent Neural Network (DBLSTM-RNN) based multimodal affect prediction framework vs. Baseline features was evaluated on Concordance correlation coefficient (CCC) for arousal and valence. A Deep Bidirectional Long Short-Term Memory Recurrent Neural Network based multimodal affect prediction framework achieved a concordance correlation coefficient of 0.747 for arousal and 0.609 for valence on the test set.

synapsesocial.com/papers/6a12bf308f1bac20a09e3988 https://doi.org/https://doi.org/10.1145/2808196.2811641