Talk delivered at the workshop "Political Representation in Europe: Methods and Data," organized by Christine Arnold and Mark Franklin, Netherlands Institute for Advanced Study in the Humanities and Social Sciences (NIAS-KNAW), Wassenaar, The Netherlands, October 25-26, 2012. Financial support from NIAS, KNAW, and FASoS, Maastricht University. The workshop brought together political scientists building long-running comparative political datasets; the author was invited as a data management expert. The talk addresses what database discipline in computer science has to offer such projects, and where common practice falls short. Six principles are developed. Metadata must be first-class data, amenable to querying and reasoning — not comments appended to records: Delenda Annotatio (comments must become metadata). Provenance is essential: what time period are the data about, when were they published or ingested, who altered or decorated them and when — questions that become critical when multiple datasets are harmonized across time and source. Hierarchies should not be trusted as stable; data is messy, and design should iterate between top-down codebooks and bottom-up inferred structure. Data rots: only the high-level meaning of a dataset may remain stable over decades; formats, platforms, institutions, and providers will change, and design must account for this. Human intervention must remain possible even where it is not planned; systems that preclude correction eventually fail in ways that cannot be corrected. The sixth principle is the most epistemologically pointed: the relevance of a web search is tautological. A relevance-ranking system returns what looks relevant; it cannot be interrogated about whether its output is correct. The talk identifies this as an epistemological deficit in the practice of mining large text corpora for political salience — but the observation applies as well as to large language models, which are optimized for fluency and coherence rather than for correctness. This is an early documented statement of that concern, addressed to a non-CS audience at a moment when generative AI did not yet exist.
Cristobal Pedregal Martin (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: