What question did this study set out to answer?

To establish a model for curating a corpus of texts for linguistic analysis.

March 12, 2026Open Access

Curating a Corpus:A Three-Phase Model

Key Points

To establish a model for curating a corpus of texts for linguistic analysis.
Describes a three-phase model: collection, cleaning, and pre-processing.
Illustrates the model with case studies from various genres and languages.
Discusses tools used in corpus curation and associated pedagogical implications.
Demonstrates how common curation phases apply across diverse studies.
Identifies challenges and next steps in corpus curation in the context of generative AI.

Abstract

We describe in this Research Note processes and protocols for curating a corpus of texts for analysis, from collection of data to readiness for analysis. We offer example case studies working with texts in different genres and languages, and using different tools, to illustrate general principles for corpus curation. Rather than a comprehensive guide for researchers interested in corpus linguistics methods, we offer a conversational starting point, supplementing our overview of three phases (collection, cleaning, and pre-processing) with authentic experiences from our own diverse research. Further, we reflect on the pedagogical implications associated with corpus linguistics, as well as the challenges and next steps in corpus curation and analysis in the age of generative AI. Our experiences show how common curation phases can be applied to different studies and contexts, and the considerations that arise when doing so.

Curating a Corpus:A Three-Phase Model

Key Points

Abstract

Cite This Study