Information extraction (IE) transforms unstructured text into structured knowledge such as entities and relations, and is fundamental to applications including knowledge graph construction, information retrieval, question answering, and domain-specific document understanding. Although large language models (LLMs) have broadened the scope of IE through zero-shot and in-context extraction, scalable IE remains challenging in realistic settings, particularly for scientific and other specialized domains where labeled data is scarce, expensive, and difficult to curate. This dissertation studies scalable information extraction with large language models from a data-centric perspective, arguing that progress requires advances in benchmark construction, noise-aware supervision, and LLM-based annotation. This dissertation makes three main contributions. First, it introduces a full-text benchmark for scientific information extraction that supports the extraction of datasets, methods, tasks, and their relations from scientific publications. The benchmark contains 106 manually annotated full-text papers with more than 24,000 entity mentions and 12,000 relations, providing a more realistic testbed than prior resources limited to abstracts or selected paragraphs. Second, it proposes DynClean, a training dynamics-based label cleaning framework for distantly supervised named entity recognition. By locating false positive and false negative annotations in weakly supervised data, DynClean improves downstream F1 by 3.19% to 8.95% across four benchmark datasets and outperforms prior distantly supervised NER methods by up to 4.53 F1 points. Third, it presents a systematic study of many-shot in-context learning for named entity recognition and develops an in-context annotation framework for low-resource settings. The results show that using around 100 human-labeled examples, LLMs can generate high-quality labeled corpora for training smaller models, yielding improvements of approximately 10 absolute F1 points over strong baselines in low-resource domain-specific NER. Taken together, these contributions support a unified view of scalable information extraction in which realistic benchmarks define meaningful tasks, noise-aware supervision improves the quality of automatically generated labels, and LLMs act as annotation engines that amplify limited human supervision. More broadly, this dissertation shows that scalable IE is not only a modeling problem, but also a data-centric systems problem requiring the joint design of resources, supervision, and adaptation mechanisms.
Qi Zhang (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: