Automatic Tag Recommendation for the UN
Humanitarian Data Exchange
Ghadeer Abuoda
a
, Chad Hendrix
b
and Stuart Campo
b
a
College of Science and Engineering, Hamad Bin Khalifa University, Qatar
b
United Nations Ofce for the Coordination of Humanitarian Afairs (OCHA), Centre for Humanitarian Data,
Netherlands
Abstract
We have recently seen a rapid growth of data portals and dataset repositories being made available
on the Web. While these repositories have been critical for advancing research, much work remains to
improve fnding appropriate datasets and relevant sources. Search engines, the primary tools for dataset
discovery, are mainly keyword-based over published metadata of the datasets, whether within dataset
repositories or over the Web. However, in most cases, the available metadata may not encompass the
essential information the user needs to decide whether the dataset fts a given task. Therefore, data
publishers should annotate their datasets with informative metadata when they add them to a dataset
repository. Tags are a particular form of metadata that the data publisher uses to describe their view
of how the dataset should be categorized. An interesting problem is how to automate the process of
recommending tags to data publishers when they add new data to a dataset repository. In this paper, we
develop an approach for automatic tag recommendation for dataset repositories. We investigate how
to exploit the features of the dataset and the tagging history in the repository to build an efective tag
recommendation model. We further demonstrate the integration of the model in the The Humanitarian
Data Exchange, a real-world dataset repository in the social and humanitarian domains.
Keywords
Dataset Repository, Dataset Tagging, Keyword Search, Tag Recommendation
1. Introduction
Nowadays, many dataset repositories and data portals are created by diferent organizations to
facilitate sharing and distribution of datasets. Online platforms like CKAN,
1
Quandl Kaggle,
2
and Microsoft Azure Marketplace
3
are examples of dataset repositories that host datasets for
data-driven research in a wide range of domains. The data in these repositories is usually
tabular (e.g., CSV fles), and the goal of the repositories is to enable data scientists to fnd, access,
integrate, and analyze combinations of datasets based on their needs. The frst step in this
process is to fnd the datasets relevant to a task, which requires information retrieval. Currently,
dataset repositories use search engines that were mainly developed for unstructured textual
documents. To improve retrieval quality, dataset repositories typically allow data publishers
BIRDS 2021: Bridging the Gap between Information Science, Information Retrieval and Data Science, March 19, 2021,
online
gabuoda@hbku.edu.qa (G. Abuoda); hendrix@un.org (C. Hendrix); campo2@un.org (S. Campo)
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
1
https://ckan.org/
2
https://www.quandl.com/
3
https://azuremarketplace.microsoft.com/en-us/marketplace/
4