What question did this study set out to answer?

The aim is to evaluate the effectiveness of large language models for rhetorical move detection compared to human analysis in academic abstracts.

March 21, 2026

From manual to automated: applying large language models to rhetorical move analysis in journal abstracts

Key Points

The aim is to evaluate the effectiveness of large language models for rhetorical move detection compared to human analysis in academic abstracts.
Utilized an established data set and BAMRC framework from previous social science research.
Compared four contemporary large language models: OpenAI GPT-4o Mini, DeepSeek Chat, Claude 3.5 Haiku, and Gemini 2.0 Flash.
Identical prompts were administered across models for consistency in analysis.
LLMs achieved 51.7%–57.9% Jaccard similarity with human annotations at the abstract level.
Intermodel agreement among LLMs was significantly higher at 77.0%–86.7%.
LLMs often produced more complete BAMRC structures and more frequent use of the Undefined category in annotations.

Abstract

Purpose This study aims to compare large language models (LLMs) with human analysis for rhetorical move detection in journal article abstracts, examining whether automated discourse analysis can complement human efforts in bibliographic metadata creation and information retrieval. Design/methodology/approach Using an established data set and BAMRC (Background, Aim, Method, Results, Conclusion) framework from a previous study of social science abstracts, four contemporary LLMs (OpenAI GPT-4o Mini, DeepSeek Chat, Claude 3.5 Haiku and Gemini 2.0 Flash) were compared against the original human analysis. Identical prompts were used across models to ensure comparability. Findings LLMs showed modest agreement with human annotations at the abstract level (51.7%–57.9% Jaccard similarity) but substantially higher intermodel agreement (77.0%–86.7%). This pattern indicates convergence toward a distinct LLM-influenced annotation style that diverges systematically from human judgment, including generally higher rates of complete BAMRC structures and more frequent use of the Undefined category at the sentence level (11.6%–29.2% vs 4.8% for humans). Agreement declined sharply at the sentence level, averaging approximately 19%. Practical implications LLM-based rhetorical analysis can provide scalable insights into abstract structure and academic discourse patterns, informing quality assessment and indexing practices in digital libraries. API-driven workflows enable low-cost, large-scale metadata analysis as a complement to human expertise, without presuming full replacement of manual processes. Originality/value This study offers a multi-LLM comparison on a decade-old, human-annotated social science abstract data set, highlighting both the potential and the limitations of transferring LLM capabilities to rhetorical move analysis and related discourse tasks.

Bookmark

From manual to automated: applying large language models to rhetorical move analysis in journal abstracts

Key Points

Abstract

Cite This Study

Also Consider

Also Consider