September 28, 2025

B – 65 Manual to Machine: Evaluating the Reliability of Artificial Intelligence for Systematic Review Data Extraction

Key Points

Overall agreement between extractors was minimal, with a Cohen's Kappa of 0.38 indicating limited reliability.
Greater agreement was observed in metadata extraction (κ = 0.47) compared to outcome data (κ = 0.37), highlighting variability in reliability.
Data extraction was performed in parallel by both a human extractor and Microsoft Copilot, limiting potential bias in evaluation.
Utilization of AI for data extraction should be limited to simpler data types to enhance reliability, as shown by the findings.

Abstract

Abstract Objective In an ongoing rapid systematic review of the use of behavior change techniques in cognitive rehabilitation interventions, Microsoft Copilot, a generative AI resource, will be utilized to support data extraction. The objective of this study is to examine data extraction reliability by Copilot vs. a human extractor. Method We searched PubMed, Scopus, WoS, and Embase databases (last updated 01/28/25) using the search terms “stroke” AND “cognition disorders” AND “cognitive training/ behavior change therapy” AND “single component/multicomponent intervention” with other variations. Review inclusion criteria included: randomized controlled trials of cognitive rehabilitation interventions applied to adults with stroke, published in English between 2004-2024, cognitive rehabilitation intervention, functional or cognitive outcome measures. 15 studies were reviewed. Data extraction was performed in parallel by one human extractor and Copilot using the same data extraction template. Data extracted included: metadata (author, location, etc.) and study outcomes. Interrater reliability was measured by assessing the agreement of extracted data between extractors with Cohen’s Kappa. Results Overall agreement between extractors was minimal (κ = 0.38; range: 0.29-0.46) There was greater agreement for metadata extraction (κ = 0.47; range: 0.38-1.0) which was minimal to almost perfect compared to outcome data (κ = 0.37; range: 0.27-0.44), which was minimal to weak. Conclusion The AI tool, Microsoft Copilot was not consistently reliable for data extraction. When utilizing an AI tool for data extraction for a systematic review, it should be restricted to the extraction of data with minimal complexity.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Grant et al. (Fri,) studied this question.

synapsesocial.com/papers/68d9051b41e1c178a14f4ce8 — DOI: https://doi.org/10.1093/arclin/acaf084.215

Authors

Adrian Grant

Social Policy Research Associates (United States)

Fedora Biney

University of Alabama at Birmingham Hospital

Helen Bliss

Journals

Archives of Clinical Neuropsychology

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

B – 65 Manual to Machine: Evaluating the Reliability of Artificial Intelligence for Systematic Review Data Extraction

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion