What question did this study set out to answer?

This research aims to advance scalable information extraction using large language models, particularly in scientific domains.

June 15, 2026Open Access

Scalable information extraction with large language models

Key Points

This research aims to advance scalable information extraction using large language models, particularly in scientific domains.
Introduced a full-text benchmark with 106 annotated papers for scientific information extraction.
Developed DynClean, a framework for cleaning labels in distantly supervised named entity recognition.
Conducted a study on many-shot in-context learning to enhance named entity recognition performance.
The benchmark supports extraction of datasets and methods from scientific texts, allowing for more realistic testing.
DynClean improved F1 scores by 3.19% to 8.95% across four datasets compared to previous methods.
LLMs showed approximately 10 absolute F1 point improvements in low-resource named entity recognition with human-labeled examples.

Abstract

Information extraction (IE) transforms unstructured text into structured knowledge such as entities and relations, and is fundamental to applications including knowledge graph construction, information retrieval, question answering, and domain-specific document understanding. Although large language models (LLMs) have broadened the scope of IE through zero-shot and in-context extraction, scalable IE remains challenging in realistic settings, particularly for scientific and other specialized domains where labeled data is scarce, expensive, and difficult to curate. This dissertation studies scalable information extraction with large language models from a data-centric perspective, arguing that progress requires advances in benchmark construction, noise-aware supervision, and LLM-based annotation. This dissertation makes three main contributions. First, it introduces a full-text benchmark for scientific information extraction that supports the extraction of datasets, methods, tasks, and their relations from scientific publications. The benchmark contains 106 manually annotated full-text papers with more than 24,000 entity mentions and 12,000 relations, providing a more realistic testbed than prior resources limited to abstracts or selected paragraphs. Second, it proposes DynClean, a training dynamics-based label cleaning framework for distantly supervised named entity recognition. By locating false positive and false negative annotations in weakly supervised data, DynClean improves downstream F1 by 3.19% to 8.95% across four benchmark datasets and outperforms prior distantly supervised NER methods by up to 4.53 F1 points. Third, it presents a systematic study of many-shot in-context learning for named entity recognition and develops an in-context annotation framework for low-resource settings. The results show that using around 100 human-labeled examples, LLMs can generate high-quality labeled corpora for training smaller models, yielding improvements of approximately 10 absolute F1 points over strong baselines in low-resource domain-specific NER. Taken together, these contributions support a unified view of scalable information extraction in which realistic benchmarks define meaningful tasks, noise-aware supervision improves the quality of automatically generated labels, and LLMs act as annotation engines that amplify limited human supervision. More broadly, this dissertation shows that scalable IE is not only a modeling problem, but also a data-centric systems problem requiring the joint design of resources, supervision, and adaptation mechanisms.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper