Fine-tuning Large Language Models (LLMs) for domains that demand extensive background knowledge, such as video game lore, involves navigating trade-offs between model architecture, scale, and the organization of training data. In this study, we examine these factors through the use of Quantized Low-Rank Adaptation (QLoRA) applied to the lore of Skyrim®. Our analysis considers nine models from the DeepSeek, Llama, and Qwen families at three parameter scales (~8B, ~13B, and ~33B). Each model was fine-tuned on unstructured, structured, and summarized datasets using LoRA ranks of 16, 32, and 64. Performance was evaluated with standard metrics (Perplexity, ROUGE, BLEU), a robust ensemble qualitative LLM-as-a-Judge framework, and a dedicated benchmark for catastrophic forgetting. The results show a consistent trade-off: structured datasets produce the most fluent outputs, while summarized datasets tend to improve factual accuracy, typically at the cost of accelerated degradation in general knowledge. Among the model families, Llama performs best at the ~8B and ~33B scales, whereas the code-specialized DeepSeek models have an edge at the ~13B size. Furthermore, our analysis of training dynamics reveals that higher LoRA ranks significantly improve convergence speed and stability. Overall, the optimal trade-off between performance, efficiency, and knowledge retention was achieved with Llama-3.1-8B fine-tuned on a summarized dataset with a LoRA rank of 64.
Monteiro et al. (Thu,) studied this question.