This paper presents a novel semi-automated approach for creating high-quality datasets through ontology-guided knowledge extraction for domain-specific large language model fine-tuning. We address the challenge of sparse knowledge graphs (KG) generated from traditional triplet extraction methods by developing a hierarchical ontology construction framework applied to procurement domain data. Our methodology begins with procurement-specific filtering of FineWeb data using keyword-based selection, reducing the dataset size by 80%. We used Llama-3.2-3B for data annotation, achieving 3,000 positive and negative samples from 44,000 processed samples, followed by training a BERT-based classifier with an F1 score of 75%. We introduce a semi-manual ontology development approach that combines structured Resource Description Framework (RDF) with targeted large language models (LLMs) prompting for focused graph node expansion. The process involves clustering of extracted nodes to reduce complexity and enable topic-specific investigation. With procurement expert validation, we generated a dataset of 140 question-answer pairs covering key ontology nodes, while rest 460 samples were generated in automated fashion using ontology prompt. Our ontology achieves a Weighted Composite Score (WCS) of 76.42%, indicating high topic coverage across the procurement domain graph. Fine-tuning experiments on Llama-3.2-1B and Llama-3.2-3B models demonstrate improvements validated through blind A/B testing using the DeepEval framework: the fine-tuned Llama-3.2-1B model was preferred over the base model in 78.15% of comparisons for answer relevancy, 77.87% for faithfulness, and 77.95% for factual consistency rate (FCR). The fine-tuned Llama-3.2-3B model showed moderate gains, winning 68.35% for answer relevancy, 72.29% for faithfulness, and 72.36% for FCR.
Shevchuk et al. (Mon,) studied this question.