What does this research mean for the field?

Fine-tuned large language models outperform dictionary methods and dedicated language identification packages in automatically detecting code-switched language snippets in historical and literary texts. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This study aims to enhance automatic identification of code switching in texts by evaluating various methods.

May 28, 2026Open Access

Advancing methods in automatic code switching detection for digital humanities

Key Points

This study aims to enhance automatic identification of code switching in texts by evaluating various methods.
Evaluated three automatic methods: dictionary, dedicated language identification packages, and fine-tuned large language models.
Tested on manually tagged French snippets in English literary texts.
Published code and methodologies for further research and application.
Fine-tuned LLMs achieved the highest detection rates for code switching in the tested models.
Identified ongoing methodological questions specific to the language pair.
Observations of French usage in English texts ranged from 1814 to 1920.

Abstract

Abstract There has been extensive research on the phenomenon of code switching, meaning the use of two or more languages or language varieties, within texts. Until recently, most code switching studies in the digital humanities have tagged the mixed languages manually, as automatic language identification methods have so far performed too unreliably to be useful, although research on automatic methods is growing. This paper aims to improve methods for identifying snippets of a second language in historical and literary texts by evaluating three automatic methods of increasing complexity: dictionary method, dedicated language identification packages, and fine-tuned large language models (LLMs). We evaluate the methods on the test case of manually tagged French snippets in English literary texts, and report that fine-tuned LLMs performed with the highest overall detection rates in our experiment with different models, although language pair-specific methodological questions remain. We have published our code and fine-tuned LLMs to assist research on this language pair, and these methods may be expanded to more language pairs and broader applications in the future. Finally, we report observations of French usage in the English literary texts in our corpus, dating 1814–1920.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper