The article substantiates the thesis that in modern technological society, the traditional linguistic dictionary has acquired a new systemic variant - the dataset. While sharing a common “object-key” principle logic, dictionaries and datasets also possess certain differences, which we illustrate using the example of a multimodal emotion dataset. It is designed for studying emotional speech in Russian and assessing the quality of automatic emotion detection across various modalities using computer models. The article aims to demonstrate the potential of datasets as a new form of systematizing and manifesting linguists’ expert knowledge in the digital era. The corpus comprises 173 minutes of video recordings of emotional narratives collected using the autobiographical MIP method with the participation of eleven women aged 1926. The recorded emotional videos were divided into 909 fragments. Each was annotated on six emotional scales (joy, sadness, anger, surprise, fear, disgust) on a 0-5 scale by six annotators (three annotators worked with one half of the sample, three with the other half) in four formats: multimodal and separate audio, text, and video fragments. The key findings from the dataset analysis are as follows. (1) When comparing modalities, the highest inter-annotator agreement scores were observed for text annotations and full multimodal annotations (a = 0.57), while the lowest was for video-only annotations (a = 0.30). (2) When comparing annotator consistency metrics across emotional classes, the highest agreement was found in assessing neutral texts; agreement was relatively high for joyful and sad texts, while mixed emotions were recognized least consistently. (3) Joy and surprise are primarily recognized when fragments are presented in audio format; sadness, fear and disgust are better identified in audio and text modalities, while anger is most accurately recognized only in text modality. (4) Presenting fragments in video format reduces recognition accuracy for all emotions, with the least impact on joy and the greatest on fear. The dataset has also proven effective as a tool for evaluating eight computer emotion recognition models, including text, audio and multimodal models. The highest alignment with human annotations was shown by text-based models, while the worst results came from video-based models. Despite some limitations related to data collection and speech segmentation, the dataset represents a valuable linguistic resource for emotion recognition research. The authors declare no conflicts of interests.
Kolmogorova et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: