Folk music teaching emphasizes both cultural preservation and emotional expression, making its evaluation complex. Conventional single-modality methods, relying only on audio or textual feedback, often fail to capture the interplay between performance accuracy, tonal quality, and student engagement. To overcome these limitations, this study proposes a Hybrid Multimodal Sentiment-Tone Analysis (HMSTA) framework that integrates speech, music, and gesture analysis to provide a holistic evaluation. The framework employs wavelet filtering for noise reduction, and music notes are normalized and categorized into types for consistent tonal representation. Mel-Frequency Cepstral Coefficients (MFCCs) are extracted from audio signals and serve as feature inputs for Convolutional Neural Networks (CNNs) that classify emotions and analyze tonal patterns. For music tone evaluation, MFCC-based features are compared against reference notes to assess pitch accuracy and rhythm stability. In parallel, gesture engagement is measured using CNN-based pose estimation to capture expressive movement during teaching and learning sessions. A multimodal attention-based fusion model integrates these features to provide synchronized, real-time assessments of both teacher delivery and student response. Experimental validation on a multimodal folk music teaching dataset of 200 sessions demonstrates that HMSTA achieves high evaluation accuracy across emotion recognition, pitch analysis, and cultural authenticity, offering a practical, data-driven framework for curriculum improvement and cultural heritage preservation. HMSTA demonstrates superior accuracy, averaging 91.7% in evaluation scores, 89.4% in emotion recognition, 93.1% in pitch accuracy, 90.6% in gesture analysis, 91.2% in cultural authenticity, and reducing processing time to 10.2 s per session.
Aijuan Zhang (Tue,) studied this question.