What question did this study set out to answer?

This research aims to reduce hallucinations in large language models by categorizing training data and incorporating epistemic metadata.

April 8, 2026Open Access

View Full Paper

VKB-Training: Epistemically and Ontologically Categorized Training Data for Hallucination Reduction in Large Language Models

Key Points

This research aims to reduce hallucinations in large language models by categorizing training data and incorporating epistemic metadata.
Developed VKB-Training with six categories for epistemic tagging of data.
Implemented a four-stage hybrid annotation pipeline including AI classification and human expert resolution.
Proposed training mechanisms such as confidence-weighted loss and ontology attention for improved model accuracy.
Achieved a structured and metadata-rich approach to training data with reduced hallucination occurrences.
Established mechanisms to manage and utilize epistemic uncertainties effectively.
Demonstrated the potential for improved knowledge representation in language models.

Abstract

Large language models hallucinate because their training data carries no epistemic metadata: facts, hypotheses, value judgments, and acknowledged unknowns occupy the same embedding space with identical weight. A deeper problem compounds this: every claim presupposes an ontology — an axiomatic framework equipped with a metric — and as Bertrand's paradox demonstrates, probability itself is ill-defined without specifying the measure. We propose VKB-Training (Verified Knowledge Base Training), a data-centric approach that assigns each training sample a six-category epistemic tag (Fact, Model, Value, Hypothesis, BlindSpot, Ontology), a calibrated confidence score, a provenance chain, and an ontology identifier specifying the axiomatic framework under which the claim is asserted. We introduce a four-stage hybrid annotation pipeline: (1) AI triangulation — multiple LLMs classify independently; inter-model disagreement signals normative content (the "Caesar/God boundary"); (2) Human sampling with axiom extraction — domain annotators resolve high-disagreement cases; recurrent decision principles are extracted as reusable rules; (3) Expert calibration with reputation weighting — formalized Galton's ox-weighing insight (per S.V.E. XI, DOI: 10.5281/zenodo.18109198); (4) Logical consistency filters — contradiction detection and symmetry verification via the CGS Method (DOI: 10.5281/zenodo.18776172). Six training mechanisms are proposed: confidence-weighted loss; provenance-aware attention; a BlindSpot training objective that maximizes output entropy at known knowledge gaps; confidence propagation through DAG-structured knowledge dependencies (conservative weakest-link and probabilistic strategies); temporal embeddings for version-aware knowledge representation; and ontology attention — a mechanism enabling the model to switch between axiomatic frameworks, with an entropy-based selection cost that balances explanatory parsimony (commit to one frame) against epistemic humility (maintain multiple frames) depending on context. Ontologies are formally defined as triples (axioms, metric, measure) and parameterized along five dimensions (core axioms, metric structure, evidential standards, scope, temporal orientation), drawing on the SES parameterization from S.V.E. XII (included as supplementary material). The paper additionally describes: multi-observer Bayesian calibration for cross-cultural epistemic consistency (observer-conditional embeddings with orthonormal cultural transformations); and a computable δ-dehumanization safety metric for detecting ethical drift in LLM outputs, derived from the broader CogOS framework (DOI: 10.5281/zenodo.18109244). VKB-Training was first described as part of the CogOS framework. This paper extracts and formalizes the VKB component as a standalone, empirically testable proposal with a falsifiable experimental protocol and pre-specified success thresholds. Section 7 (Ethical Data Sourcing: Author Revenue Sharing, 10–50%) is included in the preprint but will be omitted from the workshop submission. NOTE: ILLUSTRATIVE NUMBERS — WIP Prepared for submission to NeurIPS 2025 Workshop.

Ask AI

Helpful

Bookmark

View Full Paper

Ask AI

Helpful

Bookmark

View Full Paper

VKB-Training: Epistemically and Ontologically Categorized Training Data for Hallucination Reduction in Large Language Models

Key Points

Abstract

Cite This Study