September 12, 2025

Code Search Method in Public Repositories Using WMC Metric

Key Points

The study shows a systematic method for assessing code quality metrics in relation to repository popularity.
Key metrics for both functional and object-oriented programming paradigms were identified to evaluate repository quality effectively.
A comparative analysis of data extraction methods was conducted, highlighting the strengths and weaknesses of GitHub API, GHTorrent, and GitHub Archive.
The findings support improved dataset selection for training AI models in software engineering, enhancing defect detection processes.

Abstract

This study explores the correlation between the popularity of open-source repositories and their quality, as assessed using static code quality metrics. The primary focus is on defining key indicators for two distinct paradigms, namely functional and object-oriented programming, and developing a code search method to systematically process repositories retrieved during the search process. The ultimate purpose of the research is to design an effective code search method for identifying high-quality GitHub repositories in order to ensure a balance between repository popularity and code quality for further use in training machine learning models. To achieve this, the study explores current methods of repository extraction, defines relevant code quality metrics for the two paradigms, and analyzes the correlation between quality indicators and repository popularity. A comparative analysis of data extraction methods, in particular GitHub API, GHTorrent, and GitHub Archive, has been carried out. A detailed comparative table was created for each of these methods, assessing their advantages and limitations, and determining the most optimal approach for further work. In addition, both fundamental and niche quality metrics were identified for each programming paradigm to enable a more comprehensive evaluation of repository quality. This study examines SonarQube, which provides insights into code quality, maintainability, and technical debt, making it a valuable tool for assessing repository suitability for machine learning-based defect prediction. Many widely used open-source projects gain traction due to active community contributions and extensive use, but their intrinsic code quality does not always align with high standards. Conversely, lesser-known repositories may exhibit superior quality but lack sufficient adoption to be considered representative datasets for training machine learning models. The results of this study contribute to the broader field of software quality assurance and defect prediction by providing a structured approach to evaluating open-source repositories. The proposed method can enhance the selection of reliable datasets for training AI models in software engineering, ultimately leading to more effective defect detection and improved software quality control processes.

Demander à l'IA

Bookmark