What question did this study set out to answer?

This research aims to improve the handling of annotator diversity in subjective linguistic tasks through a novel weighted ensemble approach.

March 29, 2026Open Access

Capturing Subjectivity: A Weighted Ensemble Approach to Preserve Annotator Diversity

Key Points

This research aims to improve the handling of annotator diversity in subjective linguistic tasks through a novel weighted ensemble approach.
Developed a perspectivist framework to model annotator diversity.
Trained independent classifiers based on demographic variables.
Implemented a weighted ensemble strategy optimized for overall F1-score.
Achieved an F1-score of 0.84 on EXIST Texts datasets.
Improved performance from 0.84 to 0.91 on EXIST Memes through ensembling.
Attained an F1-score of 0.95 on the re-annotated SST-2 dataset.

Abstract

• A perspectivist framework explicitly modeling annotator diversity in subjective tasks • Independent classifiers trained on demographic perspectives like gender, age, and ethnicity • A weighted ensemble strategy based on performance-driven weights and threshold optimization • Consistent performance improvements across EXIST 2024 and re-annotated SST-2 datasets • Qualitative analysis showing effective mitigation of subgroup-specific biases and errors Subjective linguistic tasks, such as sexism detection or sentiment analysis, often involve substantial disagreement among human annotators, reflecting genuine interpretive diversity rather than annotation noise. Traditional aggregation methods, most commonly majority voting, enforce a single reference label and an artificial consensus. This is problematic because it discards information about how different groups of people interpret the same content, thereby obscuring nuances that are crucial for understanding the phenomenon under study. This paper introduces a perspectivist framework that explicitly models annotator diversity by training independent classifiers based on demographic variables and subsequently combining them through a weighted ensembling strategy. Each perspective is assigned a relative importance according to its individual performance (F1-score), and the decision threshold is optimised to maximise the overall F1-score of the ensemble. Experiments conducted on three datasets—EXIST Texts 2024, EXIST Memes 2024, and a re-annotated version of SST-2—show consistent improvements across all tasks. The weighted ensemble achieves an F1-score of 0.84 on EXIST Texts, improves performance from 0.84 to 0.91 on EXIST Memes, and attains an F1-score of 0.95 on the re-annotated SST-2 dataset. These results demonstrate that weighted perspectivist ensembling achieves a better balance between precision and recall than both individual models and standard baselines, while preserving human interpretive diversity. They highlight the potential of perspectivist modelling as a pathway towards fairer and more robust NLP systems that are better aligned with human variability.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper