Multi-modal deception detection is a challenging yet important task, having pivotal applications in many fields such as business credibility assessment and multimedia anti-frauds. Previous methods either rely solely on spatial features or overemphasize only temporal information within or across modalities, which may overlook potential critical clues. Motivated by these observations, we propose a Spatio-Temporal Representation Disentanglement (STRD) framework for multi-modal deception detection, which uses a dual-encoder structure to learn spatial and temporal representations for each modality. Specifically, we introduce a pre-trained foundation model to act as the spatial encoder and design a lightweight network as the temporal encoder, extracting spatial semantics and capturing dynamic temporal patterns. Then, we propose a Constrained Self-Attention Block (CSAB), in which self-attention distribution of each head is regarded as spatial distribution and is constrained to attend a certain facial local region. Furthermore, we present a Cross-modal Correlation Fusion Block (CCFB) to achieve temporal synchronization across modalities by measuring the correlations between visual and audio features. Extensive experiments show that our STRD outperforms the state-of-the-art methods on challenging DOLOS, BOL, BgOL, and RLtrial benchmarks. Particularly, STRD improves by 2.12% and 1.88% over the previous best results in terms of ACC on the DOLOS and BOL datasets, respectively. Additionally, STRD outperforms previous methods in cross-dataset testing, highlighting its superior generalization ability.
Shao et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: