What question did this study set out to answer?

The aim is to evaluate the effectiveness of prompting versus fine-tuning in generating language assessment items for L2 listening.

June 8, 2026Open Access

How to Train Your Dragon: Evaluating Prompting and Fine-Tuning for GPT-Based Item Generation in L2 Listening Assessment

Key Points

The aim is to evaluate the effectiveness of prompting versus fine-tuning in generating language assessment items for L2 listening.
Conducted an experimental comparison of prompting and fine-tuning methods for item generation in L2 listening assessment.
Generated 40 tests with a total of 240 multiple-choice items across three prompting iterations and one fine-tuned iteration.
Assessed item quality using a hybrid framework of rule-based metrics and human review to ensure content validity.
Fine-tuning yielded items that were more contextually grounded and linguistically coherent compared to prompting.
Fine-tuned items performed comparably to expert-generated items regarding passage dependence but were weaker in avoiding ambiguous language.
Identified persistent issues with LLM-generated items, such as bias and uneven key distribution, highlighting limitations.

Abstract

Recent advances in large language models (LLMs), notably GPT models, have introduced new possibilities for automatic item generation (AIG) through prompt engineering. Although iterative prompt refinement can yield measurable improvements, this process eventually plateaus, as outputs remain inconsistent or misaligned with assessment constructs. This study reports an experimental comparison of prompting and fine-tuning to advance AIG for L2 listening assessment. First, we employed prompting and refined the instruction design over three successive iterations to determine an optimized prompt. We then fine-tuned GPT-4.1 using the same optimized prompt, holding prompt design constant to isolate the effect of model adaptation. We generated a total of 40 tests and 240 multiple-choice items in the four model conditions: three prompt-only iterations and one fine-tuned iteration. To contextualize model performance against professional assessment standards, we also compared the GPT-generated items with 245 expert-generated listening items used as the fine-tuning dataset. The generated items were evaluated using a hybrid two-tiered framework that combined rule-based metrics with human review to assess content validity and scalability. Fine-tuning produced stronger outcomes overall, yielding items that were more contextually grounded, linguistically coherent, and balanced than those generated through prompting alone, with some requiring minimal or no revision; however, generating higher-order items involving discourse-level reasoning remained challenging. The expert comparison showed that fine-tuned items performed comparably to expert-authored items on passage dependence but remained relatively weaker in avoiding absolute language and in targeting localized spans of necessary information. Additionally, issues such as longest-correct-option bias and uneven key distribution persisted, indicating limitations inherent in LLM-generated items. These findings demonstrate the value of fine-tuning for improving item quality and stability, while underscoring the continued need for multidimensional evaluation frameworks and expert benchmarking to ensure construct-aligned, valid assessment design.

How to Train Your Dragon: Evaluating Prompting and Fine-Tuning for GPT-Based Item Generation in L2 Listening Assessment

Key Points

Abstract

Cite This Study