What question did this study set out to answer?

The aim is to identify reusable code clones by their characteristics, focusing on quality aspects.

January 25, 2026

An Empirical Study on the Characteristics of Reusable Code Clones

Key Points

The aim is to identify reusable code clones by their characteristics, focusing on quality aspects.
Analyzed 60 open-source projects, including Java and C.
Utilized machine learning classifiers and large language models.
Assessed clones based on prevalence, fault-resiliency, and longevity.
Achieved a median AUC of 0.73 and an F1-score of 0.89.
CountFollowers, SimilarityClonePaths, and CountContributors were key factors in classification.
Identified high-quality clones can reduce maintenance costs and improve software quality.

Abstract

Copy-and-paste of code is a common practice in software development, especially with the availability of myriad open-source projects. Reusing existing code fragments may accelerate the development process and ensure better quality of software if high-quality code fragments are reused. Therefore, code cloning might not be avoided even though it can increase software maintenance cost as bugs can be propagated inadvertently by code cloning and go beyond the boundary of a system. Generally speaking, reusable code clones are highly popular, have fewer bugs, and can stand the test of time thus exist in the system for a long term. Current code reuse studies primarily leverage API usage and method clone structures to provide coding recommendations. However, the existing approaches focus on the functional utility of code snippets without considering the quality aspects, such as fault resiliency, when analyzing code reuse. Without clone quality in mind, the virtues of code cloning are severely diminished. To help developers determine if the code clones are reusable or not, we leverage Machine Learning classifiers and Large Language Models (LLMs) to automatically identify reusable code clones from three perspectives: clone prevalence (i.e., the number of clone siblings), clone fault-resiliency (i.e., the percentage of non-buggy commits versus buggy commits), and clone longevity (i.e., the clone genealogy length). Our approach achieves a median AUC of 0.73, with an F1-score of 0.89, based on experiments conducted on 60 open-source projects, consisting of 30 Java and 30 C projects, which collectively encompass 538,598 commits. The results show that CountFollowers (i.e., number of people following the contributors that wrote the clones), SimilarityClonePaths (i.e., the Jaccard similarity coefficient among clone paths), and CountContributors (i.e., number of distinct contributors that access the clone group) provide the most explanatory power that contributes to correctly classifying reusable code clones. Hence, practitioners can utilize our classifiers and insights from our findings to make more reliable use of code clones and prioritize the use of high-quality clones in their clone management activities.

Bookmark

An Empirical Study on the Characteristics of Reusable Code Clones

Key Points

Abstract

Cite This Study