Key points are not available for this paper at this time.
Multimodal representation learning is gaining more and more interest within deep learning community. While bilinear models provide an interesting to find subtle combination of modalities, their number of parameters quadratically with the input dimensions, making their practical within classical deep learning pipelines challenging. In this, we introduce BLOCK, a new multimodal fusion based on the-superdiagonal tensor decomposition. It leverages the notion of block-term, which generalizes both concepts of rank and mode ranks for tensors, used for multimodal fusion. It allows to define new ways for optimizing tradeoff between the expressiveness and complexity of the fusion model, and able to represent very fine interactions between modalities while powerful mono-modal representations. We demonstrate the practical of our fusion model by using BLOCK for two challenging tasks: Visual Answering (VQA) and Visual Relationship Detection (VRD), where we end-to-end learnable architectures for representing relevant between modalities. Through extensive experiments, we show that compares favorably with respect to state-of-the-art multimodal fusion for both VQA and VRD tasks. Our code is available at: //github. com/Cadene/block. bootstrap. pytorch.
Ben-younes et al. (Thu,) studied this question.