Schema Mapping Generation in the Wild: A Demonstration with Open Government Data Lacramioara Mazilu University of Manchester, United Kingdom lara.mazilu@manchester.ac.uk Nikolaos Konstantinou University of Manchester, United Kingdom nikolaos.konstantinou@manchester.ac.uk Norman W. Paton University of Manchester, United Kingdom norman.paton@manchester.ac.uk Alvaro A.A. Fernandes University of Manchester, United Kingdom alvaro.a.fernandes@manchester.ac.uk ABSTRACT Schema mapping generation identifes how data sets can be com- bined to create views that are relevant to an application. Where the data sets to be combined lack declared relationships, such as foreign keys, schema mapping generation can be considered to be in the wild. In this paper, we describe an approach to schema mapping generation in the context of open government data, in particular, the London Datastore. Mapping generation is in- formed by inferred profling data about the data sets and their relationships, where the data sets are made available as csv fles. We outline the mapping generation algorithm, and describe a demonstration of the approach, in which the user can: (i) specify the target to be populated by the generated mappings over a collection of sources from The London Datastore; (ii) browse the generated candidate mappings and the evidence that informed their creation; and (iii) steer the mapping generation process, to make use of preferred sources and dependable profling results. 1 INTRODUCTION Given a collection of source datasets, some metadata about them, and a target schema, schema mapping generation produces a collection of views that provide ways of populating the target from the sources. Mapping generation is important because the data of relevance to an application or an analysis is often not immediately available in a single, suitable, integrated form. Most work on mapping generation has assumed that the source and the target beneft from declared constraints, for example in the form of primary and foreign keys (e.g., as in the seminal work on Clio and its descendents, as reviewed in [5]). However, with the growing availability of open data sets, and the emergence of data lakes, mapping generation over independently produced data sets, with minimal explicit metadata, is arguably even more necessary than for well-defned schemas. We refer to mapping generation over data sets without de- clared relationships as in the wild. Mapping generation must, among other things, take into account relationships between data sources, and, in this paper, we assume that candidate keys and (partial) inclusion dependencies have been inferred through data profling [1]. Then, to deploy schema mapping generation in the wild, the following are required: (1) A way of exploring the space of candidate mappings. We use a dynamic programming algorithm to identify promising mappings, referred to as Dynamap [8]. © 2020 Copyright held by the owner/author(s). Published in Proceedings of the 23rd International Conference on Extending Database Technology (EDBT), March 30- April 2, 2020, ISBN 978-3-89318-083-7 on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0. (2) A way of displaying to the user these mappings, their prop- erties, and the evidence on which they build. As (1) builds on necessarily speculative profling data, the results of mapping generation must be able to be reviewed by users, for example to ensure that joins are building on inclusion dependencies that represent valid real-world relationships. (3) A way to enable the user to steer the mapping generation process. As (2) may identify issues with generated map- pings, users must be able to steer the mapping generation process away from unsuitable decisions, for example by ruling out the use of certain inclusion dependencies. To show (1) to (3) in practice, we demonstrate our mapping generation algorithm, and its associated user interface, in use with data from The London Datastore 1 , which provides hundreds of data sets providing diverse information about London. The remainder of the paper is structured as follows. Section 2 provides some details on The London Datastore. Our mapping generation approach is reviewed in Section 3. The demonstration in Section 4 shows an example of viewing a generated mapping and understanding it based on its properties and the evidence based on which it was created. The user can steer mapping gen- eration based on the presented information. Section 5 concludes. 2 OPEN DATA CASE STUDY: THE LONDON DATASTORE Open government data is published in a collection of national, regional, city or topic-based portals, with a view to increasing transparency and informing decision making [3]. The London Datastore is a representive example of a city data repository, pro- viding data sets across a range of topic areas, including transport, employment, housing, health and education. These datasets come from a variety of publishers, including local and national gov- ernment departments, and many of the data sets use consistent, generous licenses. The London Datastore supports both search and browse interfaces, and allows data sets to be downloaded in a variety of formats. The demonstration uses comma-separated-value fle data sets, released under the UK Open Government License 2 . Typically fles contain from a few tens of rows (e.g., there are numerous data sets that have one row for each London Borough, of which there are 33), to a few thousand rows (e.g., there are around 5000 rows in a data set of modelled household income estimates at a particular, rather fne, area granularity). There may be few (e.g., 2) to many columns in each table (e.g., there are hundreds of columns in a ward atlas table, describing diferent properties of an electoral ward). 1 https://data.london.gov.uk 2 http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/ Demonstration Series ISSN: 2367-2005 615 10.5441/002/edbt.2020.77