What question did this study set out to answer?

The research aims to evaluate large language models' ability to generate encyclopedic articles, specifically in Russian.

March 5, 2026

RuWikiBench: Evaluating Large Language Models Through Replication of Encyclopedia Articles

Key Points

The research aims to evaluate large language models' ability to generate encyclopedic articles, specifically in Russian.
Developed RuWikiBench, an open benchmark based on Ruwiki.
Created three tasks: source selection, article structuring, section generation.
Tested various open-source large language models under ideal conditions.
Best performing models struggled to follow expert logic in article composition.
Models could not accurately reproduce the reference table of contents.
Quality of section generation was largely independent of model size.

Abstract

In light of the growing interest in using large language models (LLMs) as tools for generating scientific texts, the evaluation of their ability to produce encyclopedic content is becoming increasingly relevant. However, for Russian-language materials, this issue has not been sufficiently studied and existing benchmarks do not cover key aspects of analytical work with sources. This article presents RuWikiBench—an open benchmark based on Ruwiki for evaluating the ability of LLMs to reproduce Wikipedia-style articles, constructed around three tasks: selection of relevant sources, article structuring, and section generation. The results of testing popular open-source LLMs show that even under ideal conditions, the best models do not always follow the expert logic of composing encyclopedic content: even with a perfect source retrieval system, the models cannot reproduce the reference table of contents, and the quality of section generation shows almost no dependence on the number of parameters.

Bookmark

RuWikiBench: Evaluating Large Language Models Through Replication of Encyclopedia Articles

Key Points

Abstract

Cite This Study

Also Consider

Also Consider