Key points are not available for this paper at this time.
This study examines the methodological implications of using large language models (LLMs) as research assistants in coding and qualitative content analysis. We compared how ChatGPT-4o and Gemini 2.0 perform when independently coding and extracting content from university generative AI policies according to a framework of ten "vocabularies of AI competence." Our dataset comprised official AI guidelines from 33 leading global universities. Quantitative analysis of inter-coder reliability indicated significant variation across conceptual categories, with high convergence for vocabularies related to academic integrity and information accuracy, but divergence in detecting other concepts such as AI dependency. Qualitative comparisons of extraction outputs demonstrated methodological trade-offs between models, with ChatGPT-4o providing fewer but contextually richer extractions versus Gemini 2.0's more numerous but briefer quotations. These findings have important considerations for researchers employing LLMs in qualitative analysis: domain-specific reliability assessment, complementary multi-model approaches to balance analytical depth and breadth, and acknowledgment of model-dependent dataset composition.
Rughiniş et al. (Tue,) studied this question.