Abstract This study discusses a workflow to transform unstructured historical civil records into a structured database. We used handwritten text recognition of the Transkribus software to transcribe the records and explored both regular expressions and large language models to extract entities from the transcribed text. We worked with a sample of 24,000 death certificates from the Dutch Caribbean Island of Curaçao, 1879–1949 to explore if these digital methods could be implemented outside the Western context in which they are developed. The article shows the importance of scan quality. Furthermore, investing in custom-trained models early in the pipeline pays off for later steps. Also, we raise the point that quality indicators for the entire text are not particularly helpful for scholars who are often only interested in specific entities. Especially names and professions contain errors when the rest of the transcribed text might seem flawless. Finally, we found that large language models such as GPT outperform regular expressions. Still, we suggest incorporating citizen scientists in the workflow to extract or check specific entities to achieve the best possible dataset.
Hoek et al. (Wed,) studied this question.