What type of study is this?

This is a Quantitative Study study.

October 8, 2025Open Access

MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs

Key Points

Current LLMs struggle with native multilingual reasoning, none scoring above 50% on MultiNRC.
Distinct strengths and weaknesses are observed in handling linguistic, cultural, and logical reasoning tasks among 14 leading LLMs.
Models show improved performance in math reasoning in English compared to original languages, indicating challenges with culturally grounded knowledge.
Benchmark includes over 1,000 reasoning questions written by native speakers in multiple languages, unlike previous translation-based benchmarks.

Abstract

Although recent Large Language Models (LLMs) have shown rapid improvement on reasoning benchmarks in English, the evaluation of such LLMs' multilingual reasoning capability across diverse languages and cultural contexts remains limited. Existing multilingual reasoning benchmarks are typically constructed by translating existing English reasoning benchmarks, biasing these benchmarks towards reasoning problems with context in English language/cultures. In this work, we introduce the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark designed to assess LLMs on more than 1,000 native, linguistic and culturally grounded reasoning questions written by native speakers in French, Spanish, and Chinese. MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay (2) LLMs exhibit distinct strengths and weaknesses in handling linguistic, cultural, and logical reasoning tasks; (3) Most models perform substantially better in math reasoning in English compared to in original languages (+10%), indicating persistent challenges with culturally grounded knowledge.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Fabbri et al. (Wed,) studied this question.

synapsesocial.com/papers/68e6679587ecc93a24d17761 https://doi.org/https://doi.org/10.48550/arxiv.2507.17476

Bookmark

View Full Paper