What question did this study set out to answer?

This research aims to identify the most effective large language models (LLMs) for code generation and summarization using machine learning techniques.

February 12, 2026Open Access

Evaluating LLMs for Source Code Generation and Summarization Using Machine Learning Classification and Ranking

Key Points

This research aims to identify the most effective large language models (LLMs) for code generation and summarization using machine learning techniques.
Compared four open-source LLMs: Mistral, CodeLlama, Gemma 2, and Phi-3.
Utilized MBPP coding question dataset for code output analysis.
Conducted SVM classification on correlated feature pairs.
Evaluated model performance using accuracy, AUC, precision, recall, and F1 scores.
Applied RankNet for ranking summarization model effectiveness using ROUGE and BERTScore.
Maximum accuracy of 49% for code generation.
Highest AUC score reached 86% among correlated feature pairs.
Precision score peaked at 90%, and recall score reached 92%.
Gemma 2 achieved the highest RankNet win probability score of 1.93 in summarization tests.
Phi-3 model ranked second with a score of 1.66.

Abstract

The recent use of large language models (LLMs) in code generation and code summarization tasks has been widely adopted by the software engineering community. New LLMs are emerging regularly with improved functionalities, efficiency, and expanding data that allow models to learn more effectively. The lack of guidelines for selecting the right LLMs for coding tasks makes the selection a subjective choice by developers rather than a choice built on code complexity, code correctness, and linguistic similarity analysis. This research investigates the use of machine learning classification and ranking methods to select the best-suited open-source LLMs for code generation and code summarization tasks. This work conducts a comparison experiment on four open-source LLMs (Mistral, CodeLlama, Gemma 2, and Phi-3) and uses the MBPP coding question dataset to analyze code-generated outputs in terms of code complexity, maintainability, cyclomatic complexity, code structure, and LLM perplexity by collecting these as a set of features. An SVM classification problem is conducted on the highest correlated feature pairs, where the models are evaluated through performance metrics, including accuracy, area under the ROC curve (AUC), precision, recall, and F1 scores. The RankNet ranking methodology is used to evaluate code summarization model capabilities by measuring ROUGE and BERTScore accuracies between LLM code-generated summaries and the coding questions used from the dataset. The study results show a maximum accuracy of 49% for the code generation experiment, with the highest AUC score reaching 86% among the top four correlated feature pairs. The highest precision score reached is 90%, and the recall score reached up to 92%. Code summarization experiment results show Gemma 2 scored a 1.93 RankNet win probability score, and represented the highest ranking reached among other models. The phi3 model was the second-highest ranking with a 1.66 score. The research highlights the potential of machine learning to select LLMs based on coding metrics and paves the way for advancements in terms of accuracy, dataset diversity, and exploring other machine learning algorithms for other researchers.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Hussain Mahfoodh

Mustafa Hammad

Bassam A. Y. Alqaralleh

Journals

Computers

Actions

Institutions

Mutah University

American University of the Middle East

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluating LLMs for Source Code Generation and Summarization Using Machine Learning Classification and Ranking

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study