Key points are not available for this paper at this time.
Far too often, argues Ryan Cordell, "the computer" has been "treated as a window to the physical archive rather than as an integrated remediation of the archive."He implores scholars to "reckon with mass digitized historical texts as new and discrete bibliographic objects" (190).But while curated archives mediate the histories they represent, they nevertheless play a necessary role in connecting end users-be they researchers, librarians, or the public-with primary materials (Blouin 102-103).Such acts of mediation have become all the more fraught in the context of the digital humanities, as archivists and scholars use archival holdings not only to access materials, but also to prepare and analyze them for exhibition.Cordell's call to action, for us to "take the digitized text seriously within its own medium" (217) foregrounds how due excitement over material made available through mass digitization must be tempered by our acknowledging practical limitations of exhibiting material from digital collections.These limits are apparent not only in the application of computer-mediated analyses on questions of traditionally humanist inquiry, as Nan Z. Da argues, but also in the early stages of corpus creation. 1Nowhere is the potential for reduction more relevant than in the context of historical documents, for which curated research outputs such as exhibitions remain, for many end users, their only form of interaction with archival materials.Optical Character Recognition (OCR), the computer-assisted method of deriving text from image files, is a critical step in the many levels of mediation between a primary source and its appearance as digital object.OCR creates a new layer of machine-readable text, a format of structured data that can be read by a computer, which lies atop the primary source text contained within image files.In the context of corpus creation and later, exhibition, researchers add additional layers of mediation when extracting and transforming data from the digital object.It is these layers, and specifically how the limitations posed by OCR outputs impact corpus collection, with which we are primarily concerned.This study seeks to outline the hurdles, benefits, and impacts of archival analysis at scale by comparing two case studies, each with a different approach to corpus creation and exhibition.The first project, Food Riddles and Riddling Ways (the Riddle Project), 2 follows a top-down approach using search strings of relevant keywords to aggregate data from existing primary source databases.The second project, Ciphers of "The Times," 3 uses a bottom-up approach that focuses exclusively on one digital collection to create a machinereadable corpus for syntax-level computational analysis.While the two approaches create datasets from similar source material, they introduce mediation from opposite directions-the top-down approach by narrowing an existing dataset and the bottom-up approach by constructing a corpus through acts of transcription.We identify the information-seeking behaviours directing each method and how they negotiate the uncertainties of compiling imperfect OCR data from historical collections.In both cases, we understand OCR not as a passive interlocutor but rather as an invisible curator in its own right, revealing and obscuring data with substantial impact on curated outputs.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jacquelyn Sundberg
Ronny Litvack-Katzman
McGill University
Nathalie Cooke
McGill University
Building similarity graph...
Analyzing shared references across papers
Loading...
Sundberg et al. (Wed,) studied this question.
synapsesocial.com/papers/68e67f58b6db643587608443 — DOI: https://doi.org/10.21428/f1f23564.82eed51a
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: