What question did this study set out to answer?

The aim is to develop a machine learning framework that accurately predicts sulphur content in crude oil from physicochemical properties.

January 18, 2026

Machine learning framework for sulphur content prediction in crude oil using physicochemical properties

Key Points

The aim is to develop a machine learning framework that accurately predicts sulphur content in crude oil from physicochemical properties.
Analyzed 664 crude oil samples characterized by 73 physicochemical properties
Trained five regression algorithms: support vector regression, k-nearest neighbours, decision tree, random forest, and XGBoost
Validated model performance using cross-validation techniques
Performed SHAP analysis to identify key influential features
XGBoost achieved the highest test-set accuracy (R² = 0.89)
Outperformed random forest (R² = 0.74) and decision tree (R² = 0.72)
Confirmed robustness of XGBoost with mean R² = 0.92 in cross-validation
Watson K, asphaltene content, and nitrogen by weight (%) were identified as top influential features

Abstract

Abstract Accurate prediction of sulphur content in crude oil is essential for optimizing refining efficiency, ensuring environmental compliance, and improving fuel quality. This study introduces an explainable machine‐learning framework to predict sulphur weight percentage (wt.%) using a comprehensive dataset of 664 crude oil samples characterized by 73 physicochemical properties. Five regression algorithms including support vector regression, k‐nearest neighbours, decision tree, random forest, and extreme gradient boosting (XGBoost) were trained and validated under identical preprocessing and cross‐validation protocols. XGBoost achieved the highest test‐set accuracy ( R 2 = 0.89), significantly outperforming random forest ( R 2 = 0.74) and decision tree regression ( R 2 = 0.72) and other models. Cross‐validation confirmed the robustness of XGBoost (mean R 2 = 0.92), while Shapley additive explanations (SHAP) analysis identified Watson K, asphaltene content, and nitrogen by weight (%) as the most influential features. The novelty of this study lies in integrating a high‐dimensional dataset with explainable AI (SHAP) to uncover physicochemical drivers of sulphur content, thereby achieving both improved accuracy and interpretability over existing models. This data‐driven approach provides a scalable and precise sulphur estimation tool that enables refiners to optimize blending strategies, reduce desulphurization costs, and comply with stringent environmental regulations.

Bookmark

Cite This Study

Pullanikkattil et al. (Fri,) studied this question.

synapsesocial.com/papers/696c789ceb60fb80d1396ce9 https://doi.org/https://doi.org/10.1002/cjce.70240

Bookmark