What question did this study set out to answer?

The aim is to develop an AI-driven framework for detecting fraud in short-video environments using multimodal sensor data.

March 2, 2026Open Access

An AI-Driven Multimodal Sensor Fusion Framework for Fraud Perception in Short-Video and Live-Streaming Platforms

Key Points

The aim is to develop an AI-driven framework for detecting fraud in short-video environments using multimodal sensor data.
Designed a multimodal temporal alignment module to synchronize sensor signals.
Constructed a shared temporal encoding network for evolution-aware representations.
Introduced a cross-modal temporal attention fusion mechanism to weight sensor contributions dynamically.
Developed a fraud evolution modeling and risk prediction module for assessing fraud intensity.
Achieved overall accuracy of 0.941 and AUC of 0.956 in detecting fraud.
Maintained stable performance in early-stage detection with precision of 0.812 and recall of 0.704.
Outperformed conventional multimodal, text-based, and vision-based detection methods.

Abstract

With the rapid proliferation of short-video platforms and live-streaming commerce ecosystems, marketing activities are increasingly manifested through complex multimodal sensing signals. These heterogeneous sensor data streams exhibit strong temporal dependency, high cross-modal coupling, and progressive evolutionary characteristics, making early-stage fraud perception particularly challenging for conventional unimodal or static analytical paradigms. Existing approaches often fail to effectively capture weak anomalous cues emerging across multimodal channels during the initial stages of fraudulent campaigns. To address these limitations, an artificial intelligence-driven multimodal sensor perception framework is proposed for temporal fraud detection in short-video environments. A multimodal temporal alignment module is first designed to synchronize heterogeneous sensor signals with inconsistent sampling granularities. Subsequently, a shared temporal encoding network is constructed to learn evolution-aware representations across multimodal sensor sequences. On this basis, a cross-modal temporal attention fusion mechanism is introduced to dynamically weight sensor contributions at different behavioral stages. Finally, a fraud evolution modeling and early risk prediction module is developed to characterize the progressive intensification of fraudulent activities and to enable risk assessment under incomplete temporal observations. Extensive experiments conducted on real-world datasets collected from multiple mainstream short-video platforms demonstrate the effectiveness of the proposed AI-driven sensing framework. The model achieves an overall accuracy of 0.941, precision of 0.865, recall of 0.812, and F1 score of 0.838, with the AUC further reaching 0.956, significantly outperforming text-based, vision-based, temporal, and conventional multimodal baselines. In early-stage detection scenarios utilizing only the first 30% of video content, the framework maintains stable performance advantages, achieving a precision of 0.812, recall of 0.704, and F1 score of 0.754, validating its capability for proactive fraud warning.

An AI-Driven Multimodal Sensor Fusion Framework for Fraud Perception in Short-Video and Live-Streaming Platforms

Key Points

Abstract

Cite This Study