The rapid growth of publicly available online datasets has created new opportunities for machine learning research; however, these datasets are often sampled from populations that are unknown to the researcher. As a result, the processes used for data collection, labelling, and sampling are frequently undocumented, increasing the risk of unrepresentative or biased samples. When training data fail to reflect the underlying population, model outputs may propagate or amplify existing biases. This paper investigates the presence of bias in online datasets derived from different domains. The research aims to identify biases associated with the distribution of attributes which represent personal protected characteristics and to evaluate methods that mitigate these biases by improving data representativeness. Three datasets were selected to be part of the study, each containing at least one protected variable and enabling binary classification. The protected variables examined included marital status, race, and gender, respectively. Model performance was assessed using accuracy, sensitivity, and specificity. To investigate and address bias, data quality and representativeness techniques were applied, as well as bivariate statistical analysis to remove variables with no significant association with the class label. Initial results showed that accuracy alone provided a misleading picture of model performance, particularly when sensitivity was low and specificity was high. The results indicated that after applying the data representativeness and mitigation techniques, the models achieved a more balanced performance across all three metrics, despite slight reductions in accuracy and specificity. The findings highlight that accuracy alone is insufficient as a performance metric. When datasets are not representative, bias mitigation methods that balance sensitivity and specificity may reduce accuracy but lead to more equitable outcomes. Classification models perform more reliably when class distributions are balanced, and fair systems should ensure equitable accuracy, sensitivity, and specificity across all protected subgroups.
Najah et al. (Sat,) studied this question.