May 27, 2025

Generative Content Analysis for Policy Research: Comparing LLM Reliability in Analyzing Institutional AI Discourse

Key Points

Key points are not available for this paper at this time.

Abstract

This study examines the methodological implications of using large language models (LLMs) as research assistants in coding and qualitative content analysis. We compared how ChatGPT-4o and Gemini 2.0 perform when independently coding and extracting content from university generative AI policies according to a framework of ten "vocabularies of AI competence." Our dataset comprised official AI guidelines from 33 leading global universities. Quantitative analysis of inter-coder reliability indicated significant variation across conceptual categories, with high convergence for vocabularies related to academic integrity and information accuracy, but divergence in detecting other concepts such as AI dependency. Qualitative comparisons of extraction outputs demonstrated methodological trade-offs between models, with ChatGPT-4o providing fewer but contextually richer extractions versus Gemini 2.0's more numerous but briefer quotations. These findings have important considerations for researchers employing LLMs in qualitative analysis: domain-specific reliability assessment, complementary multi-model approaches to balance analytical depth and breadth, and acknowledgment of model-dependent dataset composition.

Perguntar à IA

Bookmark

Cite This Study

Rughiniş et al. (Tue,) studied this question.

synapsesocial.com/papers/6a08a9c4113ba5b476de5f90 https://doi.org/https://doi.org/10.1109/cscs66924.2025.00094

Perguntar à IA

Bookmark