Amid educational debates about generative artificial intelligence (GenAI), little research focuses on large language models’ (LLMs) reliability, which has important implications regardless of whether students are permitted to use LLMs in sociology classrooms. In this exploratory study, we focus on the intersection of GenAI and teaching and learning in sociology, asking: To what extent are LLM services, including ChatGPT, DeepSeek, and Gemini, reliable (a) with one another and (b) over (a short) time? We administered a sociology quiz with 20 multiple-choice questions of varying difficulty—and covering different topics, some of which are sensitive—to each of these LLM services over the course of two seven-day intervals: April 21 through 27, 2025, and June 13 through 19, 2025. The results indicate very high levels of reliability between LLM services and over these time intervals, but some unreliability mainly on two questions that were sensitive, required higher-level thinking, or both. Pedagogical and other implications are discussed.
Dixon et al. (Wed,) studied this question.