Publish-Time Data Integration for Open Data Platforms Julian Eberius, Patrick Damme, Katrin Braunschweig, Maik Thiele and Wolfgang Lehner Technische Universität Dresden Faculty of Computer Science, Database Technology Group 01062 Dresden, Germany firstname.lastname@tu-dresden.de ABSTRACT Platforms for publication and collaborative management of data, such as Data.gov or Google Fusion Tables, are a new trend on the web. They manage very large corpora of datasets, but often lack an integrated schema, ontology, or even just common publication standards. This results in inconsistent names for attributes of the same meaning, which constrains the discovery of relationships between datasets as well as their reusability. Existing data integration techniques focus on reuse-time, i.e., they are applied when a user wants to combine a specific set of datasets or integrate them with an existing database. In contrast, this paper investigates a novel method of data integration at publish-time, where the publisher is provided with suggestions on how to integrate the new dataset with the corpus as a whole, without re- sorting to a manually created mediated schema or ontology for the platform. We propose data-driven algorithms that propose alternative attribute names for a newly published dataset based on attribute- and instance statistics main- tained on the corpus. We evaluate the proposed algorithms using real-world corpora based on the Open Data Platform opendata.socrata.com and relational data extracted from Wikipedia. We report on the system’s response time, and on the results of an extensive crowdsourcing-based evaluation of the quality of the generated attribute names alternatives. 1. MOTIVATION Platforms for collaborative collection and reuse of datasets are a current trend on the web. A prime example for plat- forms of this kind are Open Data Platforms, such as data. gov or data.gov.uk, where government agencies publish datasets of public interest. But this trend is not limited to govern- mental efforts: organizations, corporations and citizens have become data publishers and editors too. Another exam- ple for collaborative data management is Wikipedia, which contains over a million relational-style tables in its current English version. The free-for-all nature of these data pub- lishing platforms leads to a strong heterogeneity in the cor- This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in: WOD ’13, June 03 2013, Paris, France c 2015 ACM.ISBN 978-1-4503-2020-7/13/06. DOI: http://dx.doi.org/10.1145/2500410.2500413 pora managed by these platforms, which contradicts the pri- mary goals: data reusability and composability. One specific problem of this heterogeneity is the schema vocabulary, i.e., the terms used as attribute names of the datasets on the platform. Since the data is published by various authors us- ing different terms for the same concepts, this leads to the classic data integration problem of finding equivalences be- tween attributes in different datasets. While the conventional techniques of schema matching are in principle applicable to all of these problems, they are usu- ally designed to be used at reuse-time, i.e., when it is clear which datasets should be integrated and reused together. These techniques will not help to prevent the increase of heterogeneity in repositories, where a growing number of authors publish a rising number of datasets. One solution would be to force newly published datasets to conform to a centralized schema or ontology, in which case schema matching techniques could be applied. While this may be a viable approach for some repositories, enforcing or even just creating an integrated schema will be unfeasible for multi-domain repositories with authors acting totally in- dependent of each other. Additionally, this would limit the number and diversity of published datasets, since the pub- lishing effort would increase. So if we assume that the repository in question does not have an integrated schema or ontology, data publishers are free to choose arbitrary attribute names, leading to degener- ated schemata and thus increasing integration effort at reuse time. Publish-Time Data Integration (PTDI). The system presented in this paper is based on the idea that given the right tool-support, some lightweight integration work can easily be done at publish-time, to make the the new dataset fit into the existing repository. When the user pub- lishes a new dataset the PTDI system augments the schema by alternative attribute names using statistics it maintains about the attributes and corresponding instances on the platform. To illustrate this process consider Figure 1, which depicts an exemplary corpus C consisting of four datasets ds1 to ds4 in two different domains. Furthermore, con- sider the new dataset ds+ that is to be added to the cor- pus. The system should generate the output {c_name -→ (Country, STATE)}, as the latter two attribute names are used in the existing corpus for country names. Since “Coun- try” appears two times (first column in ds1 and third column in ds3) and “STATE” only one time (first column in ds2), “Country” is ranked before “STATE”. For the second at- tribute name “c_lang” in ds+ there is no recommendation,