The recent use of large language models (LLMs) in code generation and code summarization tasks has been widely adopted by the software engineering community. New LLMs are emerging regularly with improved functionalities, efficiency, and expanding data that allow models to learn more effectively. The lack of guidelines for selecting the right LLMs for coding tasks makes the selection a subjective choice by developers rather than a choice built on code complexity, code correctness, and linguistic similarity analysis. This research investigates the use of machine learning classification and ranking methods to select the best-suited open-source LLMs for code generation and code summarization tasks. This work conducts a comparison experiment on four open-source LLMs (Mistral, CodeLlama, Gemma 2, and Phi-3) and uses the MBPP coding question dataset to analyze code-generated outputs in terms of code complexity, maintainability, cyclomatic complexity, code structure, and LLM perplexity by collecting these as a set of features. An SVM classification problem is conducted on the highest correlated feature pairs, where the models are evaluated through performance metrics, including accuracy, area under the ROC curve (AUC), precision, recall, and F1 scores. The RankNet ranking methodology is used to evaluate code summarization model capabilities by measuring ROUGE and BERTScore accuracies between LLM code-generated summaries and the coding questions used from the dataset. The study results show a maximum accuracy of 49% for the code generation experiment, with the highest AUC score reaching 86% among the top four correlated feature pairs. The highest precision score reached is 90%, and the recall score reached up to 92%. Code summarization experiment results show Gemma 2 scored a 1.93 RankNet win probability score, and represented the highest ranking reached among other models. The phi3 model was the second-highest ranking with a 1.66 score. The research highlights the potential of machine learning to select LLMs based on coding metrics and paves the way for advancements in terms of accuracy, dataset diversity, and exploring other machine learning algorithms for other researchers.
Building similarity graph...
Analyzing shared references across papers
Loading...
Hussain Mahfoodh
Mustafa Hammad
Bassam A. Y. Alqaralleh
Computers
Mutah University
American University of the Middle East
Building similarity graph...
Analyzing shared references across papers
Loading...
Mahfoodh et al. (Tue,) studied this question.
www.synapsesocial.com/papers/698d6e3c5be6419ac0d53b43 — DOI: https://doi.org/10.3390/computers15020119