What question did this study set out to answer?

The research aims to clarify the conceptualization of synthetic data and their utility across disciplines.

June 1, 2026Open Access

On the Conceptualization of Synthetic Data: Why Not All Synthetic Data are Useful and Some are

Puntos clave

The research aims to clarify the conceptualization of synthetic data and their utility across disciplines.
Examined several definitions and frameworks of synthetic data
Proposed a refined concept of synthesized data
Developed an Iceberg model to illustrate hidden risks related to data utilization.
Demonstrated that synthetic data and actual data are conceptually indistinct.
Identified submerged risks such as bias, privacy, and validity in synthetic data.
Enhanced understanding of the relational view linking data purpose with generation methods.

Resumen

Abstract Synthetic data are increasingly used across the sciences and industry and are widely regarded as a solution to data problems such as privacy, bias mitigation, and data insufficiency. At the same time, the “promises” of synthetic data have faced substantial scholarly critique. We argue that much of the disagreement surrounding the usefulness of synthetic data stems from a lack of clarity about what synthetic data actually are and what they can do. In this paper, we examine several conceptualizations of synthetic data and propose a refined concept of synthesized data that more accurately captures the phenomenon commonly referred to as synthetic data. We argue that synthetic data cannot be conceptualized in contrast with other types of data. Moreover, there is no meaningful difference between synthetic data and data. We further prove this argument through an Iceberg model of the process of data synthetization and utilization, which addresses the “submerged” data risks such as bias, privacy, and validity, thus ensuring the usefulness of synthesized data. Our approach is grounded in a relational view of data, which links their intended purpose with their method of generation. We further elaborate this connection through concrete examples. In conclusion, we demonstrate the complexity of synthesized data in relation to their utility, which often remains hidden in current discourse on synthetic data.

Me gusta

Guardar

Ver artículo completo