Current approaches to verifying AI training data compliance face a fundamental tension: copyright holders need to know whether their content was used in training (EU AI Act, Article 53(1)(d)), while model providers need to protect their training data as trade secrets (GDPR, trade secret law). Existing zero-knowledge proof systems for machine learning (ZKML) address this partially by providing proofs of non-membership for exact data points. However, real-world training pipelines involve tokenization, chunking, paraphrasing, and augmentation, rendering exact-match proofs insufficient. We identify a gap in the literature: no existing system combines semantic fingerprinting with zero-knowledge proofs to enable semantic non-membership verification. We propose an architecture for Zero-Knowledge Semantic Non-Membership (ZK-SNM) that enables a model provider to prove, without revealing any training data, that no document in their training corpus is semantically similar to a queried document above a specified threshold. We discuss the technical challenges, including the computational cost of similarity search within ZK circuits, and propose mitigation strategies based on locality-sensitive hashing and hierarchical verification. This position paper establishes the problem formulation and proposed architecture; experimental validation is left to subsequent work.
Untila Octavian (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: