Large language models hallucinate because their training data carries no epistemic metadata: facts, hypotheses, value judgments, and acknowledged unknowns occupy the same embedding space with identical weight. A deeper problem compounds this: every claim presupposes an ontology — an axiomatic framework equipped with a metric — and as Bertrand's paradox demonstrates, probability itself is ill-defined without specifying the measure. We propose VKB-Training (Verified Knowledge Base Training), a data-centric approach that assigns each training sample a six-category epistemic tag (Fact, Model, Value, Hypothesis, BlindSpot, Ontology), a calibrated confidence score, a provenance chain, and an ontology identifier specifying the axiomatic framework under which the claim is asserted. We introduce a four-stage hybrid annotation pipeline: (1) AI triangulation — multiple LLMs classify independently; inter-model disagreement signals normative content (the "Caesar/God boundary"); (2) Human sampling with axiom extraction — domain annotators resolve high-disagreement cases; recurrent decision principles are extracted as reusable rules; (3) Expert calibration with reputation weighting — formalized Galton's ox-weighing insight (per S.V.E. XI, DOI: 10.5281/zenodo.18109198); (4) Logical consistency filters — contradiction detection and symmetry verification via the CGS Method (DOI: 10.5281/zenodo.18776172). Six training mechanisms are proposed: confidence-weighted loss; provenance-aware attention; a BlindSpot training objective that maximizes output entropy at known knowledge gaps; confidence propagation through DAG-structured knowledge dependencies (conservative weakest-link and probabilistic strategies); temporal embeddings for version-aware knowledge representation; and ontology attention — a mechanism enabling the model to switch between axiomatic frameworks, with an entropy-based selection cost that balances explanatory parsimony (commit to one frame) against epistemic humility (maintain multiple frames) depending on context. Ontologies are formally defined as triples (axioms, metric, measure) and parameterized along five dimensions (core axioms, metric structure, evidential standards, scope, temporal orientation), drawing on the SES parameterization from S.V.E. XII (included as supplementary material). The paper additionally describes: multi-observer Bayesian calibration for cross-cultural epistemic consistency (observer-conditional embeddings with orthonormal cultural transformations); and a computable δ-dehumanization safety metric for detecting ethical drift in LLM outputs, derived from the broader CogOS framework (DOI: 10.5281/zenodo.18109244). VKB-Training was first described as part of the CogOS framework. This paper extracts and formalizes the VKB component as a standalone, empirically testable proposal with a falsifiable experimental protocol and pre-specified success thresholds. Section 7 (Ethical Data Sourcing: Author Revenue Sharing, 10–50%) is included in the preprint but will be omitted from the workshop submission. NOTE: ILLUSTRATIVE NUMBERS — WIP Prepared for submission to NeurIPS 2025 Workshop.
Building similarity graph...
Analyzing shared references across papers
Loading...
Artiom Kovnatsky
Laboratoire Spécification et Vérification
Building similarity graph...
Analyzing shared references across papers
Loading...
Artiom Kovnatsky (Mon,) studied this question.
www.synapsesocial.com/papers/69d5f05d74eaea4b11a79cc2 — DOI: https://doi.org/10.5281/zenodo.19436724