Current practices in the field of natural language processing may reinforce stereotypes, stigmatize non-normative speech, and prevent access to public discourse online. Across Latin America, a rapid increase in digital material available in minoritized (mostly Indigenous) languages spoken in the region has been observed, and a number of Indigenous language corpora and language models are currently under development. Given that “poor data quality in critical areas can disproportionately impact vulnerable communities and situations it is important to examine the norms and assumptions embedded within the process for building linguistic datasets. Operationalizing and measuring harms have been the primary focus of work investigating bias in natural language processing, and linguistic justice has recently been proposed as a framework for identifying harmful language ideologies in natural language processing systems. This article explores whether and how harmful ideologies of language may be informing the work of natural language processing researchers working on minoritized Mexican languages, through a systematic search and content analysis of published scholarship on natural language processing covering a 20-year period. The findings show that the field is changing rapidly, with far greater awareness of potentially harmful language ideologies in recent years, and attempts to mitigate associated bias. This work also shows that the concepts of linguistic justice and language ideology provide a fruitful framework for understanding, and potentially guiding, the further integration of ethical protocols into the construction of language technologies.
Melissa Gasparotto (Wed,) studied this question.