This paper reports on a pilot study evaluating five large language models (ChatGPT-4, Gemini 2.0 Flash, Claude 3.5 Sonnet, Perplexity AI, and DeepSeek) in gradient acceptability judgment tasks in German. The models rated 150 contextually embedded sentences on a 5-point Likert scale across five categories: gray-zone (variable) items, canonical grammatical items, ungrammatical items, diatopically marked items, and diastratically/diaphasically marked items. All models clearly distinguish between clearly grammatical and clearly ungrammatical stimuli in unambiguous morphosyntactic contexts. Mixed-effects analyses further show that differences between models vary across stimulus categories rather than reflecting a uniform global shift in acceptability ratings. These findings indicate that current LLMs robustly capture core morphosyntactic contrasts, but that model behavior is less uniform in domains involving variation and contextual sensitivity. The study contributes to the empirical assessment of LLMs as acceptability raters and informs debates on their methodological role in linguistics.
Catasso et al. (Mon,) studied this question.