Schema Mapping Generation in the Wild: A Demonstration
with Open Government Data
Lacramioara Mazilu
University of Manchester, United Kingdom
lara.mazilu@manchester.ac.uk
Nikolaos Konstantinou
University of Manchester, United Kingdom
nikolaos.konstantinou@manchester.ac.uk
Norman W. Paton
University of Manchester, United Kingdom
norman.paton@manchester.ac.uk
Alvaro A.A. Fernandes
University of Manchester, United Kingdom
alvaro.a.fernandes@manchester.ac.uk
ABSTRACT
Schema mapping generation identifes how data sets can be com-
bined to create views that are relevant to an application. Where
the data sets to be combined lack declared relationships, such as
foreign keys, schema mapping generation can be considered to
be in the wild. In this paper, we describe an approach to schema
mapping generation in the context of open government data,
in particular, the London Datastore. Mapping generation is in-
formed by inferred profling data about the data sets and their
relationships, where the data sets are made available as csv fles.
We outline the mapping generation algorithm, and describe a
demonstration of the approach, in which the user can: (i) specify
the target to be populated by the generated mappings over a
collection of sources from The London Datastore; (ii) browse the
generated candidate mappings and the evidence that informed
their creation; and (iii) steer the mapping generation process, to
make use of preferred sources and dependable profling results.
1 INTRODUCTION
Given a collection of source datasets, some metadata about them,
and a target schema, schema mapping generation produces a
collection of views that provide ways of populating the target
from the sources. Mapping generation is important because the
data of relevance to an application or an analysis is often not
immediately available in a single, suitable, integrated form.
Most work on mapping generation has assumed that the source
and the target beneft from declared constraints, for example in
the form of primary and foreign keys (e.g., as in the seminal work
on Clio and its descendents, as reviewed in [5]). However, with
the growing availability of open data sets, and the emergence
of data lakes, mapping generation over independently produced
data sets, with minimal explicit metadata, is arguably even more
necessary than for well-defned schemas.
We refer to mapping generation over data sets without de-
clared relationships as in the wild. Mapping generation must,
among other things, take into account relationships between
data sources, and, in this paper, we assume that candidate keys
and (partial) inclusion dependencies have been inferred through
data profling [1]. Then, to deploy schema mapping generation
in the wild, the following are required:
(1) A way of exploring the space of candidate mappings. We use
a dynamic programming algorithm to identify promising
mappings, referred to as Dynamap [8].
© 2020 Copyright held by the owner/author(s). Published in Proceedings of the 23rd
International Conference on Extending Database Technology (EDBT), March 30-
April 2, 2020, ISBN 978-3-89318-083-7 on OpenProceedings.org.
Distribution of this paper is permitted under the terms of the Creative Commons
license CC-by-nc-nd 4.0.
(2) A way of displaying to the user these mappings, their prop-
erties, and the evidence on which they build. As (1) builds
on necessarily speculative profling data, the results of
mapping generation must be able to be reviewed by users,
for example to ensure that joins are building on inclusion
dependencies that represent valid real-world relationships.
(3) A way to enable the user to steer the mapping generation
process. As (2) may identify issues with generated map-
pings, users must be able to steer the mapping generation
process away from unsuitable decisions, for example by
ruling out the use of certain inclusion dependencies.
To show (1) to (3) in practice, we demonstrate our mapping
generation algorithm, and its associated user interface, in use
with data from The London Datastore
1
, which provides hundreds
of data sets providing diverse information about London.
The remainder of the paper is structured as follows. Section 2
provides some details on The London Datastore. Our mapping
generation approach is reviewed in Section 3. The demonstration
in Section 4 shows an example of viewing a generated mapping
and understanding it based on its properties and the evidence
based on which it was created. The user can steer mapping gen-
eration based on the presented information. Section 5 concludes.
2 OPEN DATA CASE STUDY: THE LONDON
DATASTORE
Open government data is published in a collection of national,
regional, city or topic-based portals, with a view to increasing
transparency and informing decision making [3]. The London
Datastore is a representive example of a city data repository, pro-
viding data sets across a range of topic areas, including transport,
employment, housing, health and education. These datasets come
from a variety of publishers, including local and national gov-
ernment departments, and many of the data sets use consistent,
generous licenses. The London Datastore supports both search
and browse interfaces, and allows data sets to be downloaded in
a variety of formats.
The demonstration uses comma-separated-value fle data sets,
released under the UK Open Government License
2
. Typically
fles contain from a few tens of rows (e.g., there are numerous
data sets that have one row for each London Borough, of which
there are 33), to a few thousand rows (e.g., there are around 5000
rows in a data set of modelled household income estimates at
a particular, rather fne, area granularity). There may be few
(e.g., 2) to many columns in each table (e.g., there are hundreds
of columns in a ward atlas table, describing diferent properties
of an electoral ward).
1
https://data.london.gov.uk
2
http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
Demonstration
Series ISSN: 2367-2005 615 10.5441/002/edbt.2020.77