What question did this study set out to answer?

The talk aims to discuss data management principles from computer science that can enhance the creation and maintenance of political datasets.

May 9, 2026Open Access

On Having Data

Puntos clave

The talk aims to discuss data management principles from computer science that can enhance the creation and maintenance of political datasets.
Delivered talk at a workshop on political representation in Europe.
Developed six principles for effective data management and integration.
Emphasized the importance of metadata, provenance, and the adaptability of data structures.
Argued that metadata should be treated as first-class data for querying.
Highlighted the necessity of understanding data provenance to unify datasets.
Pointed out that current relevance-ranking systems in data mining are epistemologically deficient.

Resumen

Talk delivered at the workshop "Political Representation in Europe: Methods and Data," organized by Christine Arnold and Mark Franklin, Netherlands Institute for Advanced Study in the Humanities and Social Sciences (NIAS-KNAW), Wassenaar, The Netherlands, October 25-26, 2012. Financial support from NIAS, KNAW, and FASoS, Maastricht University. The workshop brought together political scientists building long-running comparative political datasets; the author was invited as a data management expert. The talk addresses what database discipline in computer science has to offer such projects, and where common practice falls short. Six principles are developed. Metadata must be first-class data, amenable to querying and reasoning — not comments appended to records: Delenda Annotatio (comments must become metadata). Provenance is essential: what time period are the data about, when were they published or ingested, who altered or decorated them and when — questions that become critical when multiple datasets are harmonized across time and source. Hierarchies should not be trusted as stable; data is messy, and design should iterate between top-down codebooks and bottom-up inferred structure. Data rots: only the high-level meaning of a dataset may remain stable over decades; formats, platforms, institutions, and providers will change, and design must account for this. Human intervention must remain possible even where it is not planned; systems that preclude correction eventually fail in ways that cannot be corrected. The sixth principle is the most epistemologically pointed: the relevance of a web search is tautological. A relevance-ranking system returns what looks relevant; it cannot be interrogated about whether its output is correct. The talk identifies this as an epistemological deficit in the practice of mining large text corpora for political salience — but the observation applies as well as to large language models, which are optimized for fluency and coherence rather than for correctness. This is an early documented statement of that concern, addressed to a non-CS audience at a moment when generative AI did not yet exist.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo