ABSTRACT Inland water-quality monitoring systems increasingly generate large volumes of environmental data, creating opportunities for advanced analytical methods to identify subtle pollution signals that may not be captured by traditional threshold-based monitoring approaches. This study proposes an interpretable unsupervised machine-learning framework for detecting anomalies in inland water-quality monitoring data (consisting of more than 17,000 observations across 23 states, curated by the Central Pollution Control Board (2021), from which a processed subset was used for model evaluation) using multiple complementary detection models and explainable AI techniques. The framework uses the four detection models, which are Isolation Forest, One-Class Support Vector Machine, Elliptic Envelope, and Autoencoder and is optimised for ecological feature engineering (dissolved oxygen deficit, BOD/DO ratio, coliform load index) to increase the sensitivity to complex pollution stressors. SHAP-based explanations, t-SNE projections and statistical comparisons were utilised to ensure interpretability and strong validation of anomalies. The results show that about 7.88 and 8% of observations were anomalous, with peri-urban tanks in Karnataka and Uttar Pradesh being identified to have hotspots with characteristics of oxygen depletion and microbial contamination, especially during post-monsoon seasons. The ensemble was able to identify all domain-specific threshold violations with the best performance of Isolation Forest and Autoencoder (F1 0.70).
Bhowmik et al. (Mon,) studied this question.