Context Agile Software Development (ASD) and reuse strategies are increasingly used to improve software productivity and maintainability. However, while reuse relies on structured and traceable artifacts, ASD often depends on informal elements such as user stories, limiting opportunities for systematic reuse. A recent taxonomy proposes classifying user stories to support traceability and asset reuse, but manual classification remains labor-intensive and error-prone. Objective This study investigates whether Large Language Models (LLMs) can automate the classification of user stories using a reuse-oriented taxonomy, reducing manual effort while preserving annotation quality. Method We adopted an explanatory sequential mixed-methods approach. First, a two-step prompting protocol was applied to classify user stories from 12 real-world projects using GPT-4-turbo. Then, we compared model outputs to expert annotations, measuring agreement and qualitatively analyzing disagreements to identify causes and propose corrective actions. Results The LLM achieved a 48.1% agreement rate with human labels, with project-specific performance ranging from 14.0% to 84.4%. Notably, in 46% of disagreement cases, the LLM’s classifications were judged more appropriate than the human label, and in only 25% the human labels were judged to be correct, highlighting inconsistencies in the human annotation process despite prior validation. Conclusion These initial findings suggest that LLMs can effectively assist in classifying user stories for reuse purposes. Beyond reducing labeling effort, they offer the potential as reviewers in collaborative workflows to improve consistency, transparency, and the overall quality of software artifact organization.
Souza et al. (Mon,) studied this question.