Abstract Misinformation and disinformation spread rapidly on social media, threatening public discourse, democratic processes, and social cohesion. One promising strategy to address these challenges is to evaluate the trustworthiness of entire domains (source websites) as a proxy for content credibility. This approach demands methods that are both scalable and data-driven. However, current solutions such as NewsGuard and Media Bias/Fact Check (MBFC) rely on expert assessments, cover only a limited number of domains, and some (e.g., NewsGuard) require paid subscriptions. These constraints limit their usefulness for large-scale research. This study introduces a machine-learning-based system designed to assess the quality and trustworthiness of websites. We propose a data-driven approach that leverages a large dataset of expert-rated domains to predict credibility scores for previously unseen domains using domain-level features. Our supervised regression model achieves moderate performance on test data and high performance on independent datasets, highlighting its ability to generalize to unseen domains. Using feature importance analysis, we found that PageRank-based features provided the greatest reduction in prediction error, suggesting that link-based indicators play a central role in domain trustworthiness. The solution’s scalable design accommodates the continuously evolving nature of online content, ensuring that evaluations remain timely and relevant. The framework enables continuous assessment of thousands of domains with minimal manual effort. This capability allows stakeholders (social media platforms, media monitoring organizations, content moderators, and researchers) to allocate resources more efficiently, prioritize verification efforts, and reduce exposure to questionable sources.
Mohammadmosaferi et al. (Mon,) studied this question.