What type of study is this?

October 11, 2025

Towards Automating User Story Classification with Large Language Models Using a Reuse-Oriented Taxonomy

Key Points

The LLM achieved a 48.1% agreement rate with human labels, indicating potential for automation in user story classification.
Performance varied significantly by project, ranging from 14.0% to 84.4%, highlighting model variability in real-world applications.
A mixed-methods approach revealed that LLM classifications were often deemed more suitable than human labels in 46% of disagreements.
The findings indicate that LLMs may enhance consistency and quality in software artifact organization, addressing challenges in manual classification.

Abstract

Context Agile Software Development (ASD) and reuse strategies are increasingly used to improve software productivity and maintainability. However, while reuse relies on structured and traceable artifacts, ASD often depends on informal elements such as user stories, limiting opportunities for systematic reuse. A recent taxonomy proposes classifying user stories to support traceability and asset reuse, but manual classification remains labor-intensive and error-prone. Objective This study investigates whether Large Language Models (LLMs) can automate the classification of user stories using a reuse-oriented taxonomy, reducing manual effort while preserving annotation quality. Method We adopted an explanatory sequential mixed-methods approach. First, a two-step prompting protocol was applied to classify user stories from 12 real-world projects using GPT-4-turbo. Then, we compared model outputs to expert annotations, measuring agreement and qualitatively analyzing disagreements to identify causes and propose corrective actions. Results The LLM achieved a 48.1% agreement rate with human labels, with project-specific performance ranging from 14.0% to 84.4%. Notably, in 46% of disagreement cases, the LLM’s classifications were judged more appropriate than the human label, and in only 25% the human labels were judged to be correct, highlighting inconsistencies in the human annotation process despite prior validation. Conclusion These initial findings suggest that LLMs can effectively assist in classifying user stories for reuse purposes. Beyond reducing labeling effort, they offer the potential as reviewers in collaborative workflows to improve consistency, transparency, and the overall quality of software artifact organization.

Mark Helpful

Bookmark

Relay