What question did this study set out to answer?

The aim is to assess how effectively five large language models evaluate acceptability judgments in German linguistic contexts.

March 18, 2026Open Access

Norms, Contexts and Patterns of Variation: Evaluating Acceptability Judgments of Five LLMs Across Linguistic Dimensions in German

Key Points

The aim is to assess how effectively five large language models evaluate acceptability judgments in German linguistic contexts.
Pilot study evaluating five large language models
Gradient acceptability judgment tasks with 150 sentences
Sentences rated on a 5-point Likert scale across five categories
Mixed-effects analyses to explore variability across models
Models distinguish well between grammatical and ungrammatical stimuli
Variation in model responses depends on stimulus categories
Models show less uniformity in context-sensitive aspects of language

Abstract

This paper reports on a pilot study evaluating five large language models (ChatGPT-4, Gemini 2.0 Flash, Claude 3.5 Sonnet, Perplexity AI, and DeepSeek) in gradient acceptability judgment tasks in German. The models rated 150 contextually embedded sentences on a 5-point Likert scale across five categories: gray-zone (variable) items, canonical grammatical items, ungrammatical items, diatopically marked items, and diastratically/diaphasically marked items. All models clearly distinguish between clearly grammatical and clearly ungrammatical stimuli in unambiguous morphosyntactic contexts. Mixed-effects analyses further show that differences between models vary across stimulus categories rather than reflecting a uniform global shift in acceptability ratings. These findings indicate that current LLMs robustly capture core morphosyntactic contrasts, but that model behavior is less uniform in domains involving variation and contextual sensitivity. The study contributes to the empirical assessment of LLMs as acceptability raters and informs debates on their methodological role in linguistics.

Norms, Contexts and Patterns of Variation: Evaluating Acceptability Judgments of Five LLMs Across Linguistic Dimensions in German

Key Points

Abstract

Cite This Study