What type of study is this?

September 10, 2025

Loan Repayment Default Prediction Using Supervised Machine Learning Techniques on Financial Data

Key Points

The model achieves high accuracy rates of 97.47% in predicting loan repayments using KNN with cross-validation.
KNN is tested on two datasets, achieving 88.21% accuracy on the second dataset with different nearest neighbor configurations.
Logistic regression provides competitive results, identifying correct classifications of up to 96.93% using a 70% training set.
Various training percentages (40%, 50%, 60%, 70%) and cross-validation values enhance the robustness of the predictions.

Abstract

With the enhancement of technology facilitating the expansion of businesses and thoughts, more and more people are applying for loans for personal or business use. However, banks have limited assets, which limit the amount of loans that can be granted. Identifying the right persons to grant loans to can be a time-consuming process. Banks seek to grant loans to individuals who can repay the loan on time, enabling the bank to obtain maximum profits. This work aims to solve the loan default problem with minimum costs to banks. This work consists of five main stages: pre-processing, feature extraction, machine learning techniques, evaluation models, and performance analysis to select the best machine learning models. Then, two datasets with different features are used. The first dataset has five features, and the second contains eighteen features. We are splitting the datasets into various training percentages (40%, 50%, 60% and 70%). The rest of the dataset is used for testing using only the Weka application. KNN is applied with different cross-validations, such as 15, 10, and 5, and different numbers of nearest neighbours (1, 5, 10, and 15). For the first dataset, the highest accuracy is 97.47% with two cross-validation values, 15 and 10, in the 10 nearest neighbours. The KNN was also implemented on the second dataset to compute the highest accuracy, 88.21% in three cross-validation values (15, 10, and 5) with the 15 nearest neighbours. Then, logistic regression is applied to compare the results of the correct classification value computed at the highest value of 96.93% with the (70% training set for the first dataset. The highest accuracy was obtained at 88.32% after splitting the second dataset (40%) for training and the rest for testing.

Bookmark

Loan Repayment Default Prediction Using Supervised Machine Learning Techniques on Financial Data

Key Points

Abstract

Cite This Study

Also Consider

Also Consider