November 1, 2024Open Access

Language is not a data set—Why overcoming ideologies of dataism is more important than ever in the age of AI

Key Points

Key points are not available for this paper at this time.

Abstract

Meaning, then, is derived not through content or data or even theory in a Western context, which by nature is decontextualized knowledge, but through a compassionate web of interdependent relationships that are different and valuable because of difference. (Simpson, 2014, p. 11) Helen Kelly-Holmes’ call to explore the implications for sociolinguistics arising from the increased commercially driven digitalization of society is very timely. Like Kelly-Holmes, we share the view that the growing prevalence of online and artificial intelligence (AI) technologies in all aspects of our lives requires a critical assessment of assumptions, approaches, and practices that have grounded sociolinguistic research since its inception. While our discussion confirms Helen's observations, we also urge the development of a general critical attitude toward understanding language as digital data. The starting point for our argument is Helen's claim that there is an erasure of “authentic” languages from public digital spaces, “making it more difficult to gather data on real usage because it would be necessary to rely on public areas and/or negotiate access to these private spaces” (p. 5). For us, her observation brings to the fore that treating language as data has always been problematic. We want to raise two issues: the general epistemological limitations of using digital user data as a representation of language and community, and the consequent need for methods that take seriously the study of language in its social, political, and technological context. We suggest ethnography as a method for understanding what speakers actually do, and an opening of language research to also consider the workings and socio-political embeddings of digital and generative AI language technologies. Our discussion is in the spirit of a joint fruitful and constructive debate. Let us start with a general critique of approaching language as “data” that correlates with social groups, which is so far a neglected aspect in the debates surrounding language, sociolinguistics, and AI. Historically, this discussion links to the colonial backgrounds of Western science and linguistics specifically. Colonial or missionary linguistic research (e.g., Deumert Errington, 2008) demonstrates that dominant Western epistemologies of language and research methods in linguistics were shaped during the period of European colonialism. An important legacy of European colonialism is that it “sought to fundamentally change and reorganize the social and economic order of the societies it colonized, as opposed to satisfy itself with extracting tribute” (Couldry Gal Gal Silverstein, 2014). People are social agents who pick linguistic practices based on their identity, on the goals that they want to (temporarily) foreground, and on their current understanding of an interaction based on the indexicalities that they perceive. In addition, humans “dynamically reshape the context that provides organization for their actions within the interaction itself” (Duranti multimodal vs. plain text) may also become contextualization cues and their indexicalities are not constant as different contexts have different affordances in terms of devices, literacy, and ideologies of language and media (Gershon, 2010). Without ethnographic observation and a consideration of the social and technological contexts, local meanings of language and the social indexicality of language and technology choices can easily be misinterpreted. This also applies to the linguistic output of AI tools, which is typically edited by users, according to their audiences and language ideologies, the latter increasingly influenced by the ascription of authority to data and algorithms, but possibly also by their rejection. The edited language feeds back into systems so that the whole AI arrangement becomes a complex socio-technical human–machine assemblage (Fester-Seeger et al., in preparation; Pennycook, 2024). In this, it is impossible to know what people do and why without engaging with people—the belief in objectified, decontextualized data as the sole source of knowledge creation has been problematic in the past and becomes even more so in an age of digital transnational interaction and AI interventions. This also means that we need new conceptual tools, categories, and methodological approaches to study language in a society in which digital platforms, owned by a handful of American companies, make enormous profits with their data collection activities. They feed these into AI systems, which, in turn, impact language use, language ideologies, and the formation of communities worldwide. We concur with Kelly-Holmes (2023) that we therefore cannot neglect the macro level in our research and need to put a greater focus on investigating and critically exploring sociopolitical structures and systems of commercialization and technology, and how they impact on language practices, language ideologies, linguistic research, and language policies. Our call for engagement with language in a holistic manner is thus not only a call for ethnography. We have to deepen our understanding of how language technologies are built and why. Understanding the ideological underpinnings of the market activity of the tech sector is of particular importance to fully capture the processes in which language technologies are embedded. The actual workings and motivations of digital technologies have received the least attention in linguistic research despite their impact on language practices (see however e.g., Jones et al., 2015). Critical sociological research (Couldry Zuboff, 2019). These critical insights can help inform a more nuanced understanding of the modus operandi of corporate language technologies, whose interests they serve and how they are monetized. In the overall context of changing socio-technological conditions of society, we must not forget the sociopolitical and economic context. The state has traditionally played a crucial role in the framing of sociolinguistic economies (Blommaert, 2010, p. 195). More recently, many governments of both the Global North and South appear to have adopted a techno-solutionist approach to AI, including language technologies. Public authorities have long delegated the development of digital technologies to the market, replacing government language technology policy with the strategies of the commercially driven private sector, in the belief that it would achieve social goods for all (Birhane, 2020; Morozov, 2013). This situation has resulted in an acute digital inequality among languages, where the technological readiness of populations (e.g., use of smartphones), the degree of language norming (e.g., uniform/roman scripts), the size of data sets, and/or the decision by companies to create artificial data sets (see, e.g., NLLB et al., 2022) impact on whether or not a language is provided with critical AI tools and thus becomes reified and visible in digital space. It has also created tension between the private sector of commercial providers of language technologies and the blurring role of public institutions as traditional regulators and exclusive holders of normative authority in language matters (Erdocia et al., under review). We have become utterly dependent on private technologies manufactured and controlled by a handful of opaque companies. Like the raw resource mining industries, they appear mostly indifferent to the social consequences of their activities and only invest minimally if obliged by government regulations to enhance their public image. It is expected that the state, also within supra-national organizations, regains a more active role as a guarantor of fundamental rights for users with regulatory and supervisory frameworks (see EU's Digital Services Act). In the language field, this includes public–private partnerships to develop accurate, ethical, and unbiased data sets and technologies for all (particularly “low-resource”) languages in an attempt to reduce the technology gap between English and other languages (see “Language Equality in the Digital Age” resolution, European Parliament, 2018; Rehm their social, political, financial, and linguistic dynamics; and their material affordances, practices, understandings, and the web of indexical relationships between them. Paying attention to the entire sociopolitical and technological structure that enables the existence and penetration into all spheres of life of AI—not just user's data, activities and views—will confront us with our own disciplinary assumptions, biases of dataism, categories, practices, and colonial ideologies and can only enhance our work. Sociolinguistic findings and expertise are increasingly sought out by the tech industry to help fine-tune the functioning of AI technologies. Comprehensive engagement with the intertwined online and offline context will put us in a better position to engage with this interest in our work, understand the role of language data in contemporary socio-political contexts, and, more broadly, how our work can contribute to critically engaged understanding of the sociolinguistics of AI. The authors declare no conflicts of interest.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Iker Erdocia

Bettina Migge

Britta Schneider

Journals

Journal of Sociolinguistics

Actions

Institutions

University College Dublin

Dublin City University

European University Viadrina

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Language is not a data set—Why overcoming ideologies of dataism is more important than ever in the age of AI

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study