What question did this study set out to answer?

This research aims to understand the impact of dataset composition on the performance of convolutional neural networks.

January 25, 2026Open Access

Gaining Understanding of Neural Networks with Programmatically Generated Data

Key Points

This research aims to understand the impact of dataset composition on the performance of convolutional neural networks.
Developed a framework using synthetic datasets to control visual features
Constructed four digit datasets with distinct object and background features
Trained lightweight CNNs using K-fold validation
Applied statistical evaluation to assess cross-dataset performance
Internal object patterns enhance accuracy and F1 scores compared to non-object backgrounds
Achieved near-perfect correlation (ρ=0.97) in dataset similarity prediction of model performance

Abstract

The performance of convolutional neural networks (CNNs) depends not only on model architecture but also on the structure and quality of the training data. While most artificial network interpretability methods focus on explaining trained models, less attention has been given to understanding how dataset composition itself shapes learning outcomes. This work introduces a novel framework that uses programmatically generated synthetic datasets to isolate and control visual features, enabling systematic evaluation of their contribution to CNN performance. Guided by principles from set theory, Shapley values, and the Apriori algorithm, we formalize an equivalence between CNN kernel weights and pattern frequency counts, showing that feature overlap across datasets predicts model generalization. Methods include the construction of four synthetic digit datasets with controlled object and background features, training lightweight CNNs under K-fold validation, and statistical evaluation of cross-dataset performance. The results show that internal object patterns significantly improve accuracy and F1 scores compared to non-object background features, and that a dataset similarity prediction algorithm achieves near-perfect correlation (ρ=0.97) between the predicted and observed performance. The conclusions highlight that dataset feature composition can be treated as a measurable proxy for model behavior, offering a new path for dataset evaluation, pruning, and design optimization. This approach provides a principled framework for predicting CNN performance without requiring full-scale model training.

Gaining Understanding of Neural Networks with Programmatically Generated Data

Key Points

Abstract

Cite This Study