Abstract There has been extensive research on the phenomenon of code switching, meaning the use of two or more languages or language varieties, within texts. Until recently, most code switching studies in the digital humanities have tagged the mixed languages manually, as automatic language identification methods have so far performed too unreliably to be useful, although research on automatic methods is growing. This paper aims to improve methods for identifying snippets of a second language in historical and literary texts by evaluating three automatic methods of increasing complexity: dictionary method, dedicated language identification packages, and fine-tuned large language models (LLMs). We evaluate the methods on the test case of manually tagged French snippets in English literary texts, and report that fine-tuned LLMs performed with the highest overall detection rates in our experiment with different models, although language pair-specific methodological questions remain. We have published our code and fine-tuned LLMs to assist research on this language pair, and these methods may be expanded to more language pairs and broader applications in the future. Finally, we report observations of French usage in the English literary texts in our corpus, dating 1814–1920.
Ketzan et al. (Thu,) studied this question.