What question did this study set out to answer?

This study aims to develop an integrated machine learning framework for accurately predicting insurance claim amounts and classifying fraud risk.

April 25, 2026Open Access

An Integrated Random Forest-Based Framework for Insurance Claim Amount Regression and Fraud Risk Classification in Imbalanced Datasets

Key Points

This study aims to develop an integrated machine learning framework for accurately predicting insurance claim amounts and classifying fraud risk.
Created a synthetic dataset of 15,000 insurance claims across 19 attributes.
Employed Random Forest models for regression (claim amounts) and classification (fraud detection) with feature normalization.
Utilized an 80/20 stratified holdout split for model training and evaluation.
Regression model achieved a Mean Absolute Error below INR 15,000 and R-squared above 0.70.
Classification model delivered accuracy exceeding 0.80 with fraud recall above 0.74 and F1-Score above 0.76, outperforming logistic regression.
Prediction outputs contained four business metrics and were stored in a MySQL database for analysis.

Abstract

The insurance industry confronts two analytically critical and financially consequential challenges: accurate prediction of claim settlement amounts and timely detection of fraudulent claims. Conventional approachesrule-based heuristics, logistic regression scorecards, and manual adjuster assessmentsare demonstrably inadequate for capturing the nonlinear, high-dimensional interactions that characterise modern insurance claim data. This paper presents ClaimSmart AI, a comprehensive, modular, end-to-end machine learning pipeline that addresses both challenges within a unified analytical framework. The system operates on a synthetically generated dataset of 15,000 insurance claim records encompassing 19 attributes spanning policyholder demographics, policy characteristics, vehicle parameters, claim specifics, and behavioural indicators. A dual-model architecture employs a Random Forest Regressor (150 estimators) for claim amount prediction and a Random Forest Classifier (150 estimators, balanced class weights) for binary fraud risk detection, both trained on a stratified 80/20 holdout split with StandardScaler feature normalisation and LabelEncoder categorical transformation. The regression model achieves a Mean Absolute Error below INR 15,000 and an R-squared coefficient of determination exceeding 0.70, while the classification model delivers accuracy above 0.80, fraud-class recall exceeding 0.74, and F1-Score above 0.76, surpassing logistic regression and rule-based baselines on equivalent evaluation protocols. Prediction outputs are enriched with four derived business metricspredicted claim amount, claim variance, fraud risk probability, and a three-tier fraud risk categoryand persisted to a MySQL relational database for direct consumption by Power BI and enterprise analytics platforms. Eight publication-quality visualisation charts provide comprehensive analytical coverage from fraud distribution and regional heatmaps to actual-versus-predicted scatter analysis. A mysqldump-format SQL export module ensures enterprise portability and regulatory archival compliance. The complete pipeline executes through a single orchestration script, establishing ClaimSmart AI as both a rigorous academic contribution and a practical template for production insurance analytics deployment.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

References and Citations

Add This Paper to Your Research Feed

Any time a new paper drops it will be there.