• A perspectivist framework explicitly modeling annotator diversity in subjective tasks • Independent classifiers trained on demographic perspectives like gender, age, and ethnicity • A weighted ensemble strategy based on performance-driven weights and threshold optimization • Consistent performance improvements across EXIST 2024 and re-annotated SST-2 datasets • Qualitative analysis showing effective mitigation of subgroup-specific biases and errors Subjective linguistic tasks, such as sexism detection or sentiment analysis, often involve substantial disagreement among human annotators, reflecting genuine interpretive diversity rather than annotation noise. Traditional aggregation methods, most commonly majority voting, enforce a single reference label and an artificial consensus. This is problematic because it discards information about how different groups of people interpret the same content, thereby obscuring nuances that are crucial for understanding the phenomenon under study. This paper introduces a perspectivist framework that explicitly models annotator diversity by training independent classifiers based on demographic variables and subsequently combining them through a weighted ensembling strategy. Each perspective is assigned a relative importance according to its individual performance (F1-score), and the decision threshold is optimised to maximise the overall F1-score of the ensemble. Experiments conducted on three datasets—EXIST Texts 2024, EXIST Memes 2024, and a re-annotated version of SST-2—show consistent improvements across all tasks. The weighted ensemble achieves an F1-score of 0.84 on EXIST Texts, improves performance from 0.84 to 0.91 on EXIST Memes, and attains an F1-score of 0.95 on the re-annotated SST-2 dataset. These results demonstrate that weighted perspectivist ensembling achieves a better balance between precision and recall than both individual models and standard baselines, while preserving human interpretive diversity. They highlight the potential of perspectivist modelling as a pathway towards fairer and more robust NLP systems that are better aligned with human variability.
Ramos et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: