We describe in this Research Note processes and protocols for curating a corpus of texts for analysis, from collection of data to readiness for analysis. We offer example case studies working with texts in different genres and languages, and using different tools, to illustrate general principles for corpus curation. Rather than a comprehensive guide for researchers interested in corpus linguistics methods, we offer a conversational starting point, supplementing our overview of three phases (collection, cleaning, and pre-processing) with authentic experiences from our own diverse research. Further, we reflect on the pedagogical implications associated with corpus linguistics, as well as the challenges and next steps in corpus curation and analysis in the age of generative AI. Our experiences show how common curation phases can be applied to different studies and contexts, and the considerations that arise when doing so.
Alsop et al. (Tue,) studied this question.