Automatic Tag Recommendation for the UN Humanitarian Data Exchange Ghadeer Abuoda a , Chad Hendrix b and Stuart Campo b a College of Science and Engineering, Hamad Bin Khalifa University, Qatar b United Nations Ofce for the Coordination of Humanitarian Afairs (OCHA), Centre for Humanitarian Data, Netherlands Abstract We have recently seen a rapid growth of data portals and dataset repositories being made available on the Web. While these repositories have been critical for advancing research, much work remains to improve fnding appropriate datasets and relevant sources. Search engines, the primary tools for dataset discovery, are mainly keyword-based over published metadata of the datasets, whether within dataset repositories or over the Web. However, in most cases, the available metadata may not encompass the essential information the user needs to decide whether the dataset fts a given task. Therefore, data publishers should annotate their datasets with informative metadata when they add them to a dataset repository. Tags are a particular form of metadata that the data publisher uses to describe their view of how the dataset should be categorized. An interesting problem is how to automate the process of recommending tags to data publishers when they add new data to a dataset repository. In this paper, we develop an approach for automatic tag recommendation for dataset repositories. We investigate how to exploit the features of the dataset and the tagging history in the repository to build an efective tag recommendation model. We further demonstrate the integration of the model in the The Humanitarian Data Exchange, a real-world dataset repository in the social and humanitarian domains. Keywords Dataset Repository, Dataset Tagging, Keyword Search, Tag Recommendation 1. Introduction Nowadays, many dataset repositories and data portals are created by diferent organizations to facilitate sharing and distribution of datasets. Online platforms like CKAN, 1 Quandl Kaggle, 2 and Microsoft Azure Marketplace 3 are examples of dataset repositories that host datasets for data-driven research in a wide range of domains. The data in these repositories is usually tabular (e.g., CSV fles), and the goal of the repositories is to enable data scientists to fnd, access, integrate, and analyze combinations of datasets based on their needs. The frst step in this process is to fnd the datasets relevant to a task, which requires information retrieval. Currently, dataset repositories use search engines that were mainly developed for unstructured textual documents. To improve retrieval quality, dataset repositories typically allow data publishers BIRDS 2021: Bridging the Gap between Information Science, Information Retrieval and Data Science, March 19, 2021, online gabuoda@hbku.edu.qa (G. Abuoda); hendrix@un.org (C. Hendrix); campo2@un.org (S. Campo) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://ckan.org/ 2 https://www.quandl.com/ 3 https://azuremarketplace.microsoft.com/en-us/marketplace/ 4