What question did this study set out to answer?

The aim is to provide a comprehensive framework for tracing data provenance in machine learning pipelines, linking data from source to prediction.

May 7, 2026Open Access

Traceprop: End-to-End Provenance-Guided Data Attribution for Auditable Machine Learning

Key Points

The aim is to provide a comprehensive framework for tracing data provenance in machine learning pipelines, linking data from source to prediction.
Developed an open-source Python library, Traceprop, for end-to-end data provenance tracking.
Integrated a computation-level lineage layer with gradient-based attribution methods.
Compared performance metrics against existing tools like TRAK and observed latency during model training.
Achieved sub-1% lineage overhead with 106+ array elements in production mode (1.007× on macOS, 0.979× on Linux).
Demonstrated higher lineage data score (LDS 0.622 ±0.180) on tabular data with logistic regression in under 0.22 seconds CPU time.
Showed effective provenance-guided unlearning with a forget-set loss of 0.425, outperforming the retrain-from-scratch standard (0.401).

Abstract

Traceprop is an open-source Python library providing the first unified system for end-to-end data provenance in machine learning pipelines, connecting raw source files through preprocessing, through model training, to individual predictions. Existing data attribution methods Koh and Liang, 2017, Park et al., 2023, Engstrom et al., 2024 identify which training samples influenced a prediction but operate in isolation from the data pipeline. Existing computation lineage tools (MLflow, DVC, TensorFlow MLMD) track artifact-level provenance but do not descend into the computation graph or connect to gradient-level attribution. Traceprop fills this gap by introducing a computation-level lineage layer that integrates natively with gradient-based attribution. A single Traceprop query answers: “This model made prediction X: which rows in which source files, through which preprocessing steps, most influenced that prediction, and can we reduce that influence without retraining?” We demonstrate: (1) sub-1% lineage overhead in production op-mode at 106+ array elements (1.007×on macOS, 0.979×on Linux); (2) Traceprop- LL achieving LDS 0.622 ±0.180 on tabular data (UCI Adult Income, logistic regression) at 0.22 s on CPU, and Traceprop-LL achieving LDS 0.0168 on CIFAR-2/ResNet-9 vs. TRAK’s 0.0290 at 266×lower wall-clock cost (2.6 s CPU vs. 691 s GPU); (3) provenance-guided approximate unlearning exceeding the retrain-from-scratch gold standard (forget-set loss 0.425 vs. gold 0.401, vs. 14% gap closed for random unlearning) with a test accuracy drop of only 0.5 percentage points (0.915 vs. 0.920). Traceprop directly addresses EU AI Act Article 26 audit trail obligations for high-risk AI systems, whose backstop enforcement date is 2 December 2027. The library is available at https://pypi.org/project/traceprop/

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper