What does this research mean for the field?

Large language models (LLMs) demonstrate very high levels of reliability across different services and over short time intervals, with some exceptions for sensitive or complex questions. Novelty: ClaimNovelty.CONFIRMATORY. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The study aims to evaluate the reliability of large language models (LLMs) in providing consistent answers across different services and over a short time frame.

February 26, 2026

Are Large Language Models Reliable across Services and over (a Short) Time? An Exploratory Study in Sociology with Pedagogical Implications

Key Points

The study aims to evaluate the reliability of large language models (LLMs) in providing consistent answers across different services and over a short time frame.
Administered a sociology quiz with 20 multiple-choice questions to various LLM services.
Conducted two assessments over seven-day intervals in April and June 2025.
Included questions of varying difficulty and sensitivity.
Very high levels of reliability observed between different LLM services.
Some unreliability noted on sensitive questions that demanded higher-level thinking.

Abstract

Amid educational debates about generative artificial intelligence (GenAI), little research focuses on large language models’ (LLMs) reliability, which has important implications regardless of whether students are permitted to use LLMs in sociology classrooms. In this exploratory study, we focus on the intersection of GenAI and teaching and learning in sociology, asking: To what extent are LLM services, including ChatGPT, DeepSeek, and Gemini, reliable (a) with one another and (b) over (a short) time? We administered a sociology quiz with 20 multiple-choice questions of varying difficulty—and covering different topics, some of which are sensitive—to each of these LLM services over the course of two seven-day intervals: April 21 through 27, 2025, and June 13 through 19, 2025. The results indicate very high levels of reliability between LLM services and over these time intervals, but some unreliability mainly on two questions that were sensitive, required higher-level thinking, or both. Pedagogical and other implications are discussed.

Bookmark

Are Large Language Models Reliable across Services and over (a Short) Time? An Exploratory Study in Sociology with Pedagogical Implications

Key Points

Abstract

Cite This Study