What question did this study set out to answer?

This research aims to enhance digital animation quality by integrating visual, textual, and auditory data during the content generation process.

March 31, 2026Open Access

Research on digital animation content generation technology for local cultural heritage using a multimodal data fusion method

Key Points

This research aims to enhance digital animation quality by integrating visual, textual, and auditory data during the content generation process.
Utilized Tri-Modal Transformer with Dynamic Attention Fusion for processing multiple data modalities.
Employed Vision Transformer, Multilingual BERT, and Tacotron-2 for feature extraction.
Implemented ControlNet-guided GAN for maintaining structure in animations.
Conducted experiments using the SILKNOW dataset to evaluate effectiveness.
Achieved accuracy of 0.94 and F1-score of 0.92.
Recorded Mean Opinion Score of 4.698 and cultural authenticity ratings of 4.7 out of 5.
Confirmed that dynamic multimodal fusion significantly enhances storytelling and heritage preservation.

Abstract

Current systems for animating digital cultural heritage often use unimodal processing or static multimodal fusion, which results in fragmented narratives and poor cultural authenticity. To overcome this drawback, the authors of the present research go for a Tri-Modal Transformer with Dynamic Attention Fusion (DAF) to produce culturally faithful digital animation content by modeling the visual, textual, and auditory modalities concurrently. Vision Transformer, Multilingual BERT, and Tacotron-2 are coupled together to extract fine-grained features from different modalities, while ControlNet-guided GAN-based motion synthesis maintains structural wholeness throughout the animation generation process. Conducting experiments using the SILKNOW dataset, the results show that the proposed framework has achieved a remarkable increase in its performance in terms of accuracy (0.94), F1-score (0.92), Mean Opinion Score (4.698), and cultural authenticity ratings (4.7/5) thus confirming its effectiveness. Furthermore, the findings reveal that dynamic multimodal fusion and structure-aware animation generation have a great impact on digital storytelling and cultural heritage preservation.

Bookmark

View Full Paper

Cite This Study

Jingze Li (Sun,) studied this question.

synapsesocial.com/papers/69cb64b0e6a8c024954b8b80 https://doi.org/https://doi.org/10.1007/s44163-026-01158-7

Bookmark

View Full Paper