This study addresses a critical gap in existing research by systematically comparing the performance of five popular large language models (LLMs) in supporting high-quality qualitative research. Our methodology combines a literature review of academic papers from 2020 to 2025 with a proof-of-concept experiment evaluating ScholarAI, ChatGPT-4o, Claude 3.5 Sonnet, NotebookLM and Perplexity on key qualitative analysis tasks. We sought to determine how well these generative artificial intelligence (AI) models meet established standards of methodological rigor in qualitative analysis. Findings reveal significant variation in LLM performance: the models excelled at efficiently retrieving relevant literature, summarizing content and generating insights, but exhibited inconsistencies in contextual comprehension, coding accuracy and depth of critical analysis. These results informed a novel evaluation framework aligning LLM outputs with qualitative research quality criteria, contributing guidance for researchers and practitioners. We recommend that practitioners leverage LLMs to improve productivity while exercising critical oversight of their outputs, and that researchers address ethical concerns and refine evaluation rubrics to ensure responsible AI integration. Overall, this work establishes a foundation for responsible human–AI collaboration in qualitative research by highlighting both the opportunities and challenges of using generative AI to enhance methodological rigor and accessibility.
Proença et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: