What question did this study set out to answer?

To transform unstructured historical civil records into a structured database for better accessibility and research utility.

May 6, 2026Open Access

Defeating the Haystack: lessons learned from extracting entities from Dutch Curaçao civil certificates, 1879–1949

Key Points

To transform unstructured historical civil records into a structured database for better accessibility and research utility.
Utilized handwritten text recognition of Transkribus software for transcription
Applied regular expressions and large language models for entity extraction
Analyzed a sample of 24,000 death certificates from 1879–1949 in Curaçao
Scan quality impacts transcription accuracy significantly
Custom-trained models improve extraction processes later in the workflow
Errors in specific entities like names and professions persist despite overall text quality
Large language models, such as GPT, outperform traditional regular expressions

Abstract

Abstract This study discusses a workflow to transform unstructured historical civil records into a structured database. We used handwritten text recognition of the Transkribus software to transcribe the records and explored both regular expressions and large language models to extract entities from the transcribed text. We worked with a sample of 24,000 death certificates from the Dutch Caribbean Island of Curaçao, 1879–1949 to explore if these digital methods could be implemented outside the Western context in which they are developed. The article shows the importance of scan quality. Furthermore, investing in custom-trained models early in the pipeline pays off for later steps. Also, we raise the point that quality indicators for the entire text are not particularly helpful for scholars who are often only interested in specific entities. Especially names and professions contain errors when the rest of the transcribed text might seem flawless. Finally, we found that large language models such as GPT outperform regular expressions. Still, we suggest incorporating citizen scientists in the workflow to extract or check specific entities to achieve the best possible dataset.

Defeating the Haystack: lessons learned from extracting entities from Dutch Curaçao civil certificates, 1879–1949

Key Points

Abstract

Cite This Study