October 10, 2004

Optimal multimodal fusion for multimedia data analysis

YWYi WuHeilongjiang University ECEdward Yi ChangNational Yang Ming Chiao Tung University KCKevin Chen–Chuan ChangUniversity of Illinois Urbana-Champaign

Key Points

Key points are not available for this paper at this time.

Abstract

Considerable research has been devoted to utilizing multimodal features for better understanding multimedia data. However, two core research issues have not yet been adequately addressed. First, given a set of features extracted from multiple media sources (e.g., extracted from the visual, audio, and caption track of videos), how do we determine the best modalities? Second, once a set of modal-ities has been identified, how do we best fuse them to map to se-mantics? In this paper, we propose a two-step approach. The first step finds statistically independent modalities from raw features. In the second step, we use super-kernel fusion to determine the optimal combination of individual modalities. We carefully ana-lyze the tradeoffs between three design factors that affect fusion performance: modality independence, curse of dimensionality, and fusion-model complexity. Through analytical and empirical studies, we demonstrate that our two-step approach, which achieves a care-ful balance of the three design factors, can improve class-prediction accuracy over traditional techniques.

AI에게 질문

Bookmark

View Full Paper