Deliberate misreporting of tax revenues is an old problem with fiscal authorities across the globe.The traditional countermeasures such as rule engines using thresholds and periodic reviews by humans cannot keep pace with the evasion techniques becoming more diverse and the volume of data increasing.The paper presents a proposal and benchmarks a machine-learning detection system based on three gradientboosting algorithms: AdaBoost, Gradient Boosting and XGBoost.A structured synthetic dataset of 1,000 taxpayer profiles with twelve financial and behavioral attributes was experimented with; all boosting models were trained and compared to five traditional baselines under identical conditions.The empirical findings after a complete run of a notebook indicate that XGBoost achieves a very high R2 of 0.9850, and ranks second in the overall ranking, and significantly ahead of all non-boosting models but the Random Forest.Gradient Boosting got the same R 2 as AdaBoost 0.9850 and 0.8560 respectively.These results support the argument that, iteratively constructed ensemble models are significantly more suitable than linear or proximity-based methods with ordinally-encoded tax-risk targets.
Shariq et al. (Thu,) studied this question.