Key points are not available for this paper at this time.
This study explores the effectiveness of large language models (LLMs) in summarizing instructional video transcriptions, a key application in educational technology. We assessed nine LLMs using two prompts—a simple base prompt and an enhanced, structured prompt—across 62 instructional videos. Two evaluating models, gpt-4o-mini and gemini-1.5-flash, scored the summaries based on seven criteria tailored to instructional content: overall structure, presence of examples, availability of sources, relevance, coherence, narration, and ACCURACY. Results showed notable performance differences, with models like Mistral Large and Claude 3.5 Sonnet performing best, especially with the enhanced prompt. However, the enhanced prompt improved narrative quality at the expense of structural clarity in some cases. Evaluator bias was also observed, with gpt-4o-mini assigning higher scores than gemini-1.5-flash, highlighting the need for multiple evaluators. These findings underscore the role of prompt design and model choice in educational LLM applications and suggest future research into optimizing prompts and standardizing evaluation methods.
Chomątek et al. (Mon,) studied this question.