What type of study is this?

This is a Quantitative Study study.

September 20, 2025Open Access

Leveraging LLMs for Automated Extraction and Structuring of Educational Concepts and Relationships

Puntos clave

LLMs have the potential to automate the extraction of educational concepts, improving the efficiency of course recommendations.
GPT-3.5 recorded the highest scores in quantitative metrics, but GPT-4o models produced more meaningful concepts overall.
Performance was assessed through automated experiments and human evaluations, illustrating how prompt design affects results.
Despite promising outcomes, LLM outputs still require expert revisions, indicating a need for careful implementation.

Resumen

Students must navigate large catalogs of courses and make appropriate enrollment decisions in many online learning environments. In this context, identifying key concepts and their relationships is essential for understanding course content and informing course recommendations. However, identifying and extracting concepts can be an extremely labor-intensive and time-consuming task when it has to be done manually. Traditional NLP-based methods to extract relevant concepts from courses heavily rely on resource-intensive preparation of detailed course materials, thereby failing to minimize labor. As recent advances in large language models (LLMs) offer a promising alternative for automating concept identification and relationship inference, we thoroughly investigate the potential of LLMs in automatically generating course concepts and their relations. Specifically, we systematically evaluate three LLM variants (GPT-3.5, GPT-4o-mini, and GPT-4o) across three distinct educational tasks, which are concept generation, concept extraction, and relation identification, using six systematically designed prompt configurations that range from minimal context (course title only) to rich context (course description, seed concepts, and subtitles). We systematically assess model performance through extensive automated experiments using standard metrics (Precision, Recall, F1, and Accuracy) and human evaluation by four domain experts, providing a comprehensive analysis of how prompt design and model choice influence the quality and reliability of the generated concepts and their interrelations. Our results show that GPT-3.5 achieves the highest scores on quantitative metrics, whereas GPT-4o and GPT-4o-mini often generate concepts that are more educationally meaningful despite lexical divergence from the ground truth. Nevertheless, LLM outputs still require expert revision, and performance is sensitive to prompt complexity. Overall, our experiments demonstrate the viability of LLMs as a tool for supporting educational content selection and delivery.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo