What does this research mean for the field?

Applying data representativeness and bias mitigation techniques to unrepresentative datasets leads to more equitable model outcomes by balancing sensitivity and specificity, despite slight reductions in overall accuracy. Novelty: ClaimNovelty.INCREMENTAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to identify biases in online datasets related to protected characteristics and evaluate methods to improve their representativeness.

June 14, 2026Open Access

An Investigation into Bias and Data Representativeness in Online Datasets

Key Points

This research aims to identify biases in online datasets related to protected characteristics and evaluate methods to improve their representativeness.
Three datasets containing protected variables like marital status, race, and gender were selected for analysis.
Model performance was assessed using accuracy, sensitivity, and specificity metrics.
Bivariate statistical analysis and data quality techniques were applied to address identified biases.
Initial findings revealed that accuracy was misleading, particularly with low sensitivity and high specificity.
After applying bias mitigation techniques, models showed more balanced performance across metrics, despite some reductions in accuracy.
Equitable outcomes were achieved through balancing sensitivity and specificity, indicating that balanced class distributions enhance model reliability.

Abstract

The rapid growth of publicly available online datasets has created new opportunities for machine learning research; however, these datasets are often sampled from populations that are unknown to the researcher. As a result, the processes used for data collection, labelling, and sampling are frequently undocumented, increasing the risk of unrepresentative or biased samples. When training data fail to reflect the underlying population, model outputs may propagate or amplify existing biases. This paper investigates the presence of bias in online datasets derived from different domains. The research aims to identify biases associated with the distribution of attributes which represent personal protected characteristics and to evaluate methods that mitigate these biases by improving data representativeness. Three datasets were selected to be part of the study, each containing at least one protected variable and enabling binary classification. The protected variables examined included marital status, race, and gender, respectively. Model performance was assessed using accuracy, sensitivity, and specificity. To investigate and address bias, data quality and representativeness techniques were applied, as well as bivariate statistical analysis to remove variables with no significant association with the class label. Initial results showed that accuracy alone provided a misleading picture of model performance, particularly when sensitivity was low and specificity was high. The results indicated that after applying the data representativeness and mitigation techniques, the models achieved a more balanced performance across all three metrics, despite slight reductions in accuracy and specificity. The findings highlight that accuracy alone is insufficient as a performance metric. When datasets are not representative, bias mitigation methods that balance sensitivity and specificity may reduce accuracy but lead to more equitable outcomes. Classification models perform more reliably when class distributions are balanced, and fair systems should ensure equitable accuracy, sensitivity, and specificity across all protected subgroups.

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper