What question did this study set out to answer?

To evaluate the effectiveness of folk music teaching through a multimodal machine learning framework.

February 26, 2026Open Access

Comprehensive evaluation of folk music teaching effects based on multimodal machine learning

Puntos clave

To evaluate the effectiveness of folk music teaching through a multimodal machine learning framework.
Developed a Hybrid Multimodal Sentiment-Tone Analysis framework integrating speech, music, and gesture analysis.
Employed wavelet filtering for noise reduction and normalized music notes for tonal representation.
Extracted Mel-Frequency Cepstral Coefficients as features for classification using Convolutional Neural Networks.
Measured gesture engagement using CNN-based pose estimation.
Conducted experimental validation with 200 sessions of multimodal folk music teaching data.
Achieved high evaluation accuracy across various metrics, including 91.7% overall accuracy.
Emotion recognition accuracy was reported at 89.4%.
Pitch accuracy reached 93.1% along with significant gesture analysis at 90.6%.
Cultural authenticity also showed a strong result at 91.2%.
Reduced processing time to 10.2 seconds per session.

Resumen

Folk music teaching emphasizes both cultural preservation and emotional expression, making its evaluation complex. Conventional single-modality methods, relying only on audio or textual feedback, often fail to capture the interplay between performance accuracy, tonal quality, and student engagement. To overcome these limitations, this study proposes a Hybrid Multimodal Sentiment-Tone Analysis (HMSTA) framework that integrates speech, music, and gesture analysis to provide a holistic evaluation. The framework employs wavelet filtering for noise reduction, and music notes are normalized and categorized into types for consistent tonal representation. Mel-Frequency Cepstral Coefficients (MFCCs) are extracted from audio signals and serve as feature inputs for Convolutional Neural Networks (CNNs) that classify emotions and analyze tonal patterns. For music tone evaluation, MFCC-based features are compared against reference notes to assess pitch accuracy and rhythm stability. In parallel, gesture engagement is measured using CNN-based pose estimation to capture expressive movement during teaching and learning sessions. A multimodal attention-based fusion model integrates these features to provide synchronized, real-time assessments of both teacher delivery and student response. Experimental validation on a multimodal folk music teaching dataset of 200 sessions demonstrates that HMSTA achieves high evaluation accuracy across emotion recognition, pitch analysis, and cultural authenticity, offering a practical, data-driven framework for curriculum improvement and cultural heritage preservation. HMSTA demonstrates superior accuracy, averaging 91.7% in evaluation scores, 89.4% in emotion recognition, 93.1% in pitch accuracy, 90.6% in gesture analysis, 91.2% in cultural authenticity, and reducing processing time to 10.2 s per session.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo