Abstract Synthetic data are increasingly used across the sciences and industry and are widely regarded as a solution to data problems such as privacy, bias mitigation, and data insufficiency. At the same time, the “promises” of synthetic data have faced substantial scholarly critique. We argue that much of the disagreement surrounding the usefulness of synthetic data stems from a lack of clarity about what synthetic data actually are and what they can do. In this paper, we examine several conceptualizations of synthetic data and propose a refined concept of synthesized data that more accurately captures the phenomenon commonly referred to as synthetic data. We argue that synthetic data cannot be conceptualized in contrast with other types of data. Moreover, there is no meaningful difference between synthetic data and data. We further prove this argument through an Iceberg model of the process of data synthetization and utilization, which addresses the “submerged” data risks such as bias, privacy, and validity, thus ensuring the usefulness of synthesized data. Our approach is grounded in a relational view of data, which links their intended purpose with their method of generation. We further elaborate this connection through concrete examples. In conclusion, we demonstrate the complexity of synthesized data in relation to their utility, which often remains hidden in current discourse on synthetic data.
Pashevich et al. (Sat,) studied this question.