Current systems for animating digital cultural heritage often use unimodal processing or static multimodal fusion, which results in fragmented narratives and poor cultural authenticity. To overcome this drawback, the authors of the present research go for a Tri-Modal Transformer with Dynamic Attention Fusion (DAF) to produce culturally faithful digital animation content by modeling the visual, textual, and auditory modalities concurrently. Vision Transformer, Multilingual BERT, and Tacotron-2 are coupled together to extract fine-grained features from different modalities, while ControlNet-guided GAN-based motion synthesis maintains structural wholeness throughout the animation generation process. Conducting experiments using the SILKNOW dataset, the results show that the proposed framework has achieved a remarkable increase in its performance in terms of accuracy (0.94), F1-score (0.92), Mean Opinion Score (4.698), and cultural authenticity ratings (4.7/5) thus confirming its effectiveness. Furthermore, the findings reveal that dynamic multimodal fusion and structure-aware animation generation have a great impact on digital storytelling and cultural heritage preservation.
Jingze Li (Sun,) studied this question.