The objective of aspect extraction is to identify the key informational elements in a text. Although aspect-based sentiment analysis (ABSA) has extensively explored this field, aspect extraction in scientific texts remains an area that has been underexplored. The present paper introduces a new multi-domain corpus of Russian and Kazakh scientific texts, annotated for aspect-based information extraction. This dataset is an expansion of existing resources for named entity recognition and relation extraction. It facilitates research in cross-lingual transfer and establishes initial benchmarks for aspect extraction in low-resource linguistic contexts. The presented corpus includes 412 abstracts in Russian and Kazakh, annotated with 2, 129 and 2, 027 aspects respectively across seven categories: The following elements are to be considered: AIM, METHOD, MATERIAL, TASK, TOOL, RESULT, and USAGE. The present study analyses the distribution of aspects across four scientific domains (IT, linguistics, medicine, and psychology) and conducts experiments using multiple methodological classes, including classical deep learning, contextual Transformer encoder (mBERT), and a new multilingual XLM-RoBERTa + CRF architecture. The experimental results demonstrate the efficacy of multilingual models in performing zero-shot aspect extraction between Russian and Kazakh, even in low-resource conditions. Future research will focus on the optimisation of tokenisation and the exploration of semi-supervised approaches to further enhance model performance. The resulting models and dataset are available at https: //github. com/nikitashvarts/scimdixₐspectₑxtraction and can support downstream applications such as automatic metadata generation, construction of scientific knowledge graphs, and domain-specific information retrieval.
Shvarts et al. (Thu,) studied this question.