What question did this study set out to answer?

The aim is to develop a Deep Active Learning (DAL) pipeline to improve refactoring prediction under limited labeling resources.

May 1, 2026Open Access

Deep Active Learning for Label-Efficient Refactoring Prediction

Key Points

The aim is to develop a Deep Active Learning (DAL) pipeline to improve refactoring prediction under limited labeling resources.
Introduced a DAL pipeline for iterative training of a classifier on software-metric representations.
Evaluated in a pool-based setting across various refactoring datasets including class, method, and variable levels.
Utilized a consistent training protocol with multiple query strategies.
DAL achieved near full-data effectiveness with significantly fewer labels: 11.4% for class-level, 25.0% for method-level, and 20.0% for variable-level refactorings.
Resulted in 75-89% savings in labeling efforts.
Uncertainty-based and dropout-enhanced strategies were consistently most effective across all refactoring types.

Abstract

Software refactoring improves the maintainability of code and reduces technical debt, but making the construction of a labeled refactoring dataset is a costly and labor-intensive process. To make refactoring prediction more deployable under limited annotation budgets, this paper introduces a Deep Active Learning (DAL) pipeline that iteratively trains a deep neural classifier on software-metric representations and selectively queries labels for the most informative unlabeled entities. Our proposed approach is evaluated in a pool-based setting across class-, method-, and variable-level refactoring datasets (multiple refactoring types) using a consistent training protocol and a broad set of query strategies. Results show that DAL can recover near full-data effectiveness with substantially fewer labels: on average, reaching the target performance requires 11.4% labeled data for class-level, 25.0% for method-level, and 20.0% for variable-level refactorings—corresponding to roughly 75–89% labeling savings, demonstrating improved data efficiency for refactoring prediction. Moreover, uncertainty-based and dropout-enhanced strategies were the most consistently effective query strategies across refactoring types and labeling budgets.

Mark Helpful

Bookmark

Relay

View Full Paper