Key points are not available for this paper at this time.
With the explosive popularity of social media, more and more people, including those with depressive symptoms, are starting to express their emotions online through vlogs recently, which makes it important for video-based depression recognition. As video data contains rich acoustical and visual information, the main challenges faced by existing methods include (1) how to accurately mine features associated with depression in massive data and (2) how to effectively fuse various features from different modalities. In this paper, a multi-domain acoustical-visual information fusion network (MDAVIF) is designed to extract depressive spatio-temporal features from image sequences and audios, and an adaptive feature interaction module is proposed to mix these features. Combined with two autoencoders to retain information and prevent overfitting, the proposed method obtains the state-of-the-art result with the precision of 74.25% and the F1-Score of 75.25% when evaluated on the D-vlog dataset.
Ling et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: