What question did this study set out to answer?

This research aims to develop a robust framework using hybrid machine learning and deep learning models for detecting ad click fraud.

May 3, 2026Open Access

An Explainable Hybrid Deep Learning and Gradient Boosting Framework for Ad Click Fraud Detection

Key Points

This research aims to develop a robust framework using hybrid machine learning and deep learning models for detecting ad click fraud.
Utilized the Ad Click Fraud Detection Dataset from Kaggle for training data.
Implemented various machine learning (ML) and deep learning (DL) models, including LSTM, GRU, and Gradient Boosting with ensemble techniques.
Applied preprocessing techniques like RUS and SMOTE to address data imbalance and enhance model accuracy.
Achieved 100% accuracy, precision, recall, and F1-score with the Voting Classifier.
Demonstrated the effectiveness of advanced ML and DL models in detecting sophisticated click fraud patterns.
Provided interpretable outputs using XAI techniques such as LIME and SHAP.

Abstract

Abstract Regrettably, one of the biggest issues of the digital advertising world is click fraud. It is very expensive and renders online marketing statistics less dependable. Old methods of detection of fraud do not always detect new fraud patterns and sophisticated user actions. In this research Ad Click Fraud Detection Dataset on Kaggle is utilized. It has records on the way the users reacted to both real and fabricated ad hits. In the preprocessing stage, the unnecessary characteristics were discarded and class imbalance was corrected with the help of RUS and SMOTE. This was in order to ensure the quality of data and impartiality. Some of the ML and DL models that were developed and experimented with include CNN, DNN, RNN, LR, DT, RF, KNN, ANN, Gradient Boosting, LightGBM, XGBoost, NB and SVM. In order to make the predictions even more accurate, we employed LSTM, GRU, LSTM +GRU, and a Voting Classifier (XGBoost with Bagging DT). XAI (LIME and SHAP) was also used by us to simplify the results. The application of the learned models was also developed into a Flask-based web interface to predict ad click fraud in real time. Due to tests, the Voting Classifier achieved 100% accuracy, precision, recall, and F1-score, which is powerful and can be used as an effective method of detecting fake ad clicks. Keywords: Click fraud, machine learning, deep learning, online-advertising, bot detection, pay-per click, fraud 1. Introduction The online advertisement is a significant aspect of business marketing in the contemporary digital economy, as it enables enterprises to access the right individuals both on the computer and phones. Current systems of online marketing are founded on the PPC model where the advertisers pay each time a person clicks on one of their ads 1. The more companies invest in digital advertisements, the more they will rely on user click statistics to determine the success of their campaign and the amount of money they are earning back. However, such reliance has also created massive security gaps particularly regarding click scams where the PPC systems are deceived to generate money 2. One of the major issues that affect the digital advertising business is click fraud, a process where an individual clicks an advertisement without their authorization or with ill motives 3. Such activities are normally performed by software robots, farm of clicks, or rivalry, which results in deformed interaction metrics and wasted cash 4. The common methods of identifying fraud can be heuristic or statistical techniques and they cannot always catch up with the emerging forms of fraud 5. VPNs, proxy servers, and distributed botnets allow fraudsters to conceal their names, posing as actual users, thus complicating the process of catching them 6. Consequently, the gap in the research on designing models capable of altering and being cognized to discern faked clicks in a changing real world context continues to exist significantly 7. The paper proposes an intelligent analytical approach to detecting ad click fraud with the help of user interaction with advertising data. This is aimed at making sure that legal and illegal activities are rightly differentiated and that the system can be understood and used in real time 8. The proposed model is expected to produce a scalable, transparent, and data-driven detection system due to the employment of behavioural characteristics and contact patterns. XAI is another feature added to make things more understandable and make the prediction made by the model more understandable and appealing to the marketer 9. Eradication of fraud and safeguarding advertisers funds, promotion of healthy competition are some of the things that this research contributes to the improvement of online advertisement. Ultimately, it establishes trust, reliability and sustainability in the online marketing space 10. 2. Related Work Over the last several years, much research was conducted on the methods of ML and AI to identify fake clicks on online advertisements. Shaik and Kakulapati 11 developed a feature based ML strategy of detecting fake ad clicks, demonstrating that trained models have the capability to classify user click behavior accurately. Their study was however limited by small data and failure to keep up with changing fraud trends. The traditional classifiers were also employed to detect click fraud online by Aljabri and Mohammad 12, as data-based approaches are superior to rule-based ones. Although they achieved some good results, their work was more about accuracy than being able to be interpreted and scaled right away which are significant in the application in the real world in advertising networks. Sisodia and Sisodia 13 developed a generalization stack architect on predicting publisher behavior where the datasets are highly skewed. The trick behind this framework was to use more than one base learner in order to achieve better classification. Nevertheless, the algorithm consumed a significant amount of computational resources and it was not clear what features were most significant. This work was extended by Sisodia, Sisodia, and Singh 14, who focused on the significance of important features in the process of bad publisher identification and highlighted the usefulness of feature selection in fraud detection. Instead, their research merely examined fixed data sets and did not examine how to counter click fraud mechanisms that always evolve. Alzahrani and Aljabri 15 have conducted an entire research on the AI-based approaches to detecting ad click fraud. They highlighted XAI and ensemble learning as the fields that require further research in the future. They found in their review that they were in a recurring issue on balancing between producing very accurate models and models that are easy to understand and apply to real-life applications. The scam detection systems have become more powerful with the help of DL techniques. In PPC campaigns, Batool and Byun 16 proposed an ensemble DL system to locate click frauds. This architecture was closer to the truth and it was less susceptible to being misled by malevolent user actions, however it was harder to execute in real time since it had so many complications. Sisodia and Sisodia 17 developed a model based on the K-nearest neighbor using quad divide prototype selection to recognize an uneven dataset. The model minimizes the computational burnt out yet sensitive to noises and fraud tendencies that are not apparent. Similarly, Chari et al. 18 considered several ML algorithms applying behavioural and feature engineering methods. They achieved good results and when using large datasets problems of scalability and overfitting occurred. Also besides the work of ads research, large studies of fraud detection have provided us with valuable information. Dekou et al. 19 considered ML approaches to identifying fraud in online marketplaces, with the domain adaptation and generalizability to discover the various forms of fraud in platforms. Also Sisodia and Sisodia 20 applied gradient boosting to identify fake publishers, but this was more precise compared to the conventional classifiers but had issues with timing data and interpretation of the model. 3. Materials And Methods The idea behind the proposed system is to generate a scalable, intelligent system of detecting fake ad clicks that is intelligible and precise. The approach involves pre-processing the Ad Click Fraud Detection Dataset on Kaggle that contains annotated records of user interaction and feature optimization, feature encoding, and normalization to ensure the quality of inputs to be used to train the model. Data mismatch is corrected with the help of RUS and SMOTE to produce more general results. Several various ML and DL models are applied, and to perform the predictions even more precise, ensemble learning is applied with a Voting Classifier that includes XGBoost, Bagging, and DT. The advanced designs of LSTM, GRU and hybrid LSTM +GRU networks are applied to understand the changes in user behavior over time. XAI techniques such as LIME and SHAP are also designed to enable the model to be more understandable and a Flask-based deployment interface allows fraud to be detected in real time. This comprehensive approach ensures that this ad click fraud detection is more scalable, reliable, and powerful. Fig. 1. System Architecture Figure 1 demonstrates the system design, which indicates a full workflow of the ad click fraud work finding. The first step is to obtain the AD Click Fraud data. Secondly is data pre-processing which involves cleaning, removal of nulls, removal of duplicates and label encoding. Correlation visualization and analysis of data help us understand the trends, and the use of such techniques as SMOTE ensures the even distribution of the data. The data has been divided into a train and test set. A model is built then by using ML and DL. Plain models are tested, graded with metrics, saved, and deployed with the use of Flask. This allows XAI to describe real-time fraud with LIME and SHAP. Dataset Collection The data set of the present research is the Ad Click Fraud Detection Dataset of the Kaggle webpage. This data consists of 5,000 entries of user interaction obtained in various forms of online ads. The dataset contains 21 features as well as categorical, numerical, and time-related, such as device type, browser, duration of clicking, and behavioral indicators. The is fraudulent column decides whether it is true or false. This data includes diverse realistic user actions and a skewed class distribution that reflects the occurrence of fraud in the real world. It is a highly diverse and versatile model of user-machine interactions and time series, and thus can be used to test sophisticated ML and DL models to detect counterfeit ad clicks. Fig. 2. Ad Click Fraud Dataset Pre-Processing In the

An Explainable Hybrid Deep Learning and Gradient Boosting Framework for Ad Click Fraud Detection

Key Points

Abstract

Cite This Study