What question did this study set out to answer?

The research aims to develop and analyze a corpus for aspect extraction in scientific texts from multiple domains including Russian and Kazakh.

March 23, 2026Open Access

SciMDIX: A dataset for aspect extraction from multi-domain scientific documents in Kazakh and Russian

Key Points

The research aims to develop and analyze a corpus for aspect extraction in scientific texts from multiple domains including Russian and Kazakh.
Created a dataset of 412 abstracts annotated for aspect-based information extraction in Kazakh and Russian.
Analyzed aspect distribution across four scientific domains: IT, linguistics, medicine, and psychology.
Conducted experiments using deep learning, mBERT, and XLM-RoBERTa + CRF architecture.
Achieved effective zero-shot aspect extraction between Russian and Kazakh using multilingual models.
Annotated datasets include 2,129 aspects in Russian and 2,027 aspects in Kazakh across seven categories.

Abstract

The objective of aspect extraction is to identify the key informational elements in a text. Although aspect-based sentiment analysis (ABSA) has extensively explored this field, aspect extraction in scientific texts remains an area that has been underexplored. The present paper introduces a new multi-domain corpus of Russian and Kazakh scientific texts, annotated for aspect-based information extraction. This dataset is an expansion of existing resources for named entity recognition and relation extraction. It facilitates research in cross-lingual transfer and establishes initial benchmarks for aspect extraction in low-resource linguistic contexts. The presented corpus includes 412 abstracts in Russian and Kazakh, annotated with 2, 129 and 2, 027 aspects respectively across seven categories: The following elements are to be considered: AIM, METHOD, MATERIAL, TASK, TOOL, RESULT, and USAGE. The present study analyses the distribution of aspects across four scientific domains (IT, linguistics, medicine, and psychology) and conducts experiments using multiple methodological classes, including classical deep learning, contextual Transformer encoder (mBERT), and a new multilingual XLM-RoBERTa + CRF architecture. The experimental results demonstrate the efficacy of multilingual models in performing zero-shot aspect extraction between Russian and Kazakh, even in low-resource conditions. Future research will focus on the optimisation of tokenisation and the exploration of semi-supervised approaches to further enhance model performance. The resulting models and dataset are available at https: //github. com/nikitashvarts/scimdixₐspectₑxtraction and can support downstream applications such as automatic metadata generation, construction of scientific knowledge graphs, and domain-specific information retrieval.

SciMDIX: A dataset for aspect extraction from multi-domain scientific documents in Kazakh and Russian

Key Points

Abstract

Cite This Study