What question did this study set out to answer?

The aim is to develop a method for efficiently extracting and structuring knowledge from LLM research papers.

April 17, 2026Open Access

Automated Knowledge Extraction from Large Language Model Research Papers for the ORKG Model Landscape

Key Points

The aim is to develop a method for efficiently extracting and structuring knowledge from LLM research papers.
Developed an NLP workflow for parsing research papers from sources like arXiv.
Applied LLM-based extraction to structure data into the ORKG format.
Evaluated 18 LLMs using established properties with precision, recall, and F1 metrics.
Utilized BERTScore for evaluating outputs in longer text fields.
Characterized extraction quality across various LLM types and identified common issues.
Results revealed challenges in extracting numerically dense and context-dependent properties.
Yielded a reproducible pipeline for curating machine-actionable LLM knowledge.

Abstract

Scientific publishing remains document-centric: much knowledge is embedded in natural language across PDFs, posts, and repositories, which limits machine-assisted discovery and reuse. In generative AI, frequent releases of Large Language Models (LLMs) scatter core facts architecture, training, parameters, licensing, and applications across heterogeneous sources, so maintaining a stable, queryable model catalog is difficult. Knowledge Graph-based infrastructures such as the Open Research Knowledge Graph (ORKG) address this gap through structured, machine-actionable descriptions aligned with FAIR principles. This thesis presents an NLP workflow that parses research papers (e.g. arXiv), applies LLM-based extraction under the ORKG LLM template, and maps outputs into the “Generative AI Model Landscape” comparison, including support for multivariant papers. In total, 18 LLMsareevaluated: 10 models following the supervisor’s taxonomy one vision-language model, three thinking or reasoning-focused models, and six instruction-tuned models spanning small to very large scales plus eight additional compact open models (1B–8B parameters) run locally for comparison. Evaluation uses property-level precision, recall, and F1 with strict and fuzzy matching, complemented by BERTScore on longer fields. The results characterize extraction quality across model types and highlight recurring failure modes for numerically dense and context-dependent properties. The work yields a reproducible end-to-end pipeline and supports curating machine-actionable LLM knowledge in the ORKG.

Automated Knowledge Extraction from Large Language Model Research Papers for the ORKG Model Landscape

Key Points

Abstract

Cite This Study