What question did this study set out to answer?

This review aims to explore the application and reporting quality of large language models in qualitative research.

June 17, 2026Open Access

The use and methodological reporting of large language models in qualitative research: a scoping review

Key Points

This review aims to explore the application and reporting quality of large language models in qualitative research.
Conducted a scoping review following PRISMA-ScR guidelines and Joanna Briggs Institute framework.
Searched five databases for empirical studies utilizing LLMs in qualitative research between January 2020 and May 2025.
Screened 4,201 studies and extracted data on methodological characteristics and LLM implementation details.
75 studies included, with OpenAI GPT models used in 93% of them.
Coding assistance (n=43) and theme identification (n=41) were the most common LLM applications.
Only 13 studies reported temperature settings, and 45% did not specify deployment configuration.

Abstract

Abstract Background Large language models (LLMs) are being integrated into qualitative research processes, yet the scope, function, and reporting quality of their use remain poorly understood. Existing reporting guidelines for qualitative research, including for example the Consolidated Criteria for Reporting Qualitative Research (COREQ), provide minimal guidance for documenting LLM use. This scoping review provides an overview of the emerging use of LLMs applications in qualitative research and assesses the associated reporting practices. Methods A scoping review was conducted following the PRISMA-ScR guidelines and the Joanna Briggs Institute methodological framework. Five databases (PubMed, CINAHL, PsycINFO, Business Source Premier, and Scopus) were searched for peer-reviewed empirical studies published between January 2020 and May 2025 that employed at least one LLM in a substantive qualitative research stage. The search yielded 5, 049 records, of which 4, 201 remained after duplicate removal. Studies were screened independently by multiple reviewers, and data were extracted using a standardized template capturing study metadata, methodological characteristics, and comprehensive LLM implementation details. Results Seventy-five studies were included. OpenAI GPT models dominated the field, appearing in 93% of studies. LLMs were applied across the full spectrum of qualitative research, with coding assistance (n = 43) and theme identification (n = 41) as the most common applications. Thematic analysis was the predominant qualitative method (n = 38), and content analysis (n = 12). Technical reporting was highly inconsistent: only 13 studies reported temperature settings, 12 documented context length, and 4 provided topₚ values. Approximately half of studies (45%, n = 34) did not specify the deployment configuration (API, web interface, or local), and 75% (n = 56) reported no parameter settings at all. While 61% of studies provided complete or partial prompts, 13% reported no prompting details. Agreement rates between LLM and human coders ranged from 36% to 99%, reflecting substantial variation related to task complexity, prompt engineering quality, and validation rigor. Nearly all studies (95%) discussed ethical considerations, and 97% incorporated human verification of AI outputs. Discussion LLMs have been adopted across qualitative research workflows, yet critical methodological details are frequently underreported, undermining comparability. The findings highlight an urgent need for dedicated reporting guidelines, such as the COREQ + LLM extension, to ensure that LLM-assisted qualitative research meets standards of transparency, rigor, and interpretive depth. Future research should address the predominance of proprietary models, the limited evidence for non-English contexts, and the need for systematic comparison of models, prompting strategies, and validation approaches.

Mark Helpful

Bookmark

Relay

View Full Paper