Key points are not available for this paper at this time.
Considerable research has been devoted to utilizing multimodal features for better understanding multimedia data. However, two core research issues have not yet been adequately addressed. First, given a set of features extracted from multiple media sources (e.g., extracted from the visual, audio, and caption track of videos), how do we determine the best modalities? Second, once a set of modal-ities has been identified, how do we best fuse them to map to se-mantics? In this paper, we propose a two-step approach. The first step finds statistically independent modalities from raw features. In the second step, we use super-kernel fusion to determine the optimal combination of individual modalities. We carefully ana-lyze the tradeoffs between three design factors that affect fusion performance: modality independence, curse of dimensionality, and fusion-model complexity. Through analytical and empirical studies, we demonstrate that our two-step approach, which achieves a care-ful balance of the three design factors, can improve class-prediction accuracy over traditional techniques.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yi Wu
Heilongjiang University
Edward Yi Chang
National Yang Ming Chiao Tung University
Kevin Chen–Chuan Chang
University of Illinois Urbana-Champaign
University of Illinois Urbana-Champaign
University of California, Santa Barbara
IBM (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Wu et al. (Sun,) studied this question.
synapsesocial.com/papers/6a025512e7b2554f3af6020f — DOI: https://doi.org/10.1145/1027527.1027665