Recent advances in large language models (LLMs), notably GPT models, have introduced new possibilities for automatic item generation (AIG) through prompt engineering. Although iterative prompt refinement can yield measurable improvements, this process eventually plateaus, as outputs remain inconsistent or misaligned with assessment constructs. This study reports an experimental comparison of prompting and fine-tuning to advance AIG for L2 listening assessment. First, we employed prompting and refined the instruction design over three successive iterations to determine an optimized prompt. We then fine-tuned GPT-4.1 using the same optimized prompt, holding prompt design constant to isolate the effect of model adaptation. We generated a total of 40 tests and 240 multiple-choice items in the four model conditions: three prompt-only iterations and one fine-tuned iteration. To contextualize model performance against professional assessment standards, we also compared the GPT-generated items with 245 expert-generated listening items used as the fine-tuning dataset. The generated items were evaluated using a hybrid two-tiered framework that combined rule-based metrics with human review to assess content validity and scalability. Fine-tuning produced stronger outcomes overall, yielding items that were more contextually grounded, linguistically coherent, and balanced than those generated through prompting alone, with some requiring minimal or no revision; however, generating higher-order items involving discourse-level reasoning remained challenging. The expert comparison showed that fine-tuned items performed comparably to expert-authored items on passage dependence but remained relatively weaker in avoiding absolute language and in targeting localized spans of necessary information. Additionally, issues such as longest-correct-option bias and uneven key distribution persisted, indicating limitations inherent in LLM-generated items. These findings demonstrate the value of fine-tuning for improving item quality and stability, while underscoring the continued need for multidimensional evaluation frameworks and expert benchmarking to ensure construct-aligned, valid assessment design.
Aryadoust et al. (Mon,) studied this question.