EXOD: a tool for building and exploring a large graph of open datasets Tianyang Liu a , Fatma Bouali b,a , Gilles Venturini a a University Fran¸ cois Rabelais of Tours, France b University of Lille 2, France Abstract We present in this paper a tool called EXOD (“EXploration of Open Datasets”) for the visual analysis of a large collection of open datasets. EXOD aims at helping the users to find datasets of interest. EXOD starts with the download of a large collection of datasets from an Open data web site. For each dataset, it extracts its meta-data and its content. To describe each dataset in a vector space, EXOD extracts features by using text mining techniques. It considers both the metadata and the content of each dataset. Using this feature space, EXOD establishes a proximity graph by computing the Relative Neighborhood Graph. Considering the size of the collection, EXOD uses a GPU-based implementation for building this graph. We visualize the graph using the Tulip software and provide a visual and interactive global map of the collection. We developed a specific plug-in for Tulip to download and open the datasets in an interactive way. All of the presented results concern the French Open Data. EXOD was able to process 293,000 datasets, and half of this collection was visualized in Tulip. We show how clusters and other information can be discovered and how the created links can be used for local and content-based exploration. Keywords: Open data mining, Proximity graphs, Graph interactive visualization 1. Introduction Many countries have recently delivered Open Data. Such data deal with many aspects and topics of the life of citizens, as it can be observed from the web sites that deliver such data (see Table 1). These topics range from taxes to immigration, from entertainment to crimes, and many other subjects. An Open dataset generally consists in metadata, explaining its source and aims, and a data file, which contains one or more tables with numbers and addition information (see an example in Figure 2). So a huge amount of information is published on-line by the governments and other data providers. In the case of the French Open Data, over 350,000 datasets are publicly available. In general, the governmental Open data web sites offer to the users only limited search mechanisms that are based on traditional search engines. No (or reduced) visual or interactive possibili- ties are provided to the users of these sites. With such an access to information, and given the large number of available open datasets, the users might have difficulties to find a dataset of interest. The context of this study thus concerns the help that can be provided to the users to better explore a large collection of open datasets. When exploring a dataset, the users should obtain an overview first and then details on demand [1]. Therefore, our first objective is to provide an overview of a collection of open datasets. This overview should be visual and interactive. It should reveal overall information about the collection, such as the presence of clusters, the relations between such clusters, the topics they deal with, their size, the outliers, etc. Such a global map could be useful for the users who wish to globally analyze the collection and to find datasets dealing with desired topics. A second objective of our work is to provide details about clusters and datasets and to make suggestions to the users for the local and detailed exploration of datasets. When observing a dataset, the users could be interested in exploring other datasets with similar content. To do so, search engines should be able to make suggestions to the users. As will be seen in the next section, Open data mining is a recent issue, and therefore, there are no specific visual and in- teractive approaches that could be directly used to achieve our objectives. Instead, several techniques from different domains are relevant to the problem we deal with. We need first to ex- tract information and features from the open datasets in order to represent them in a feature space. Also, we must find how to build a global view and how to create links between datasets with respect to their similarity. We should also use an interac- tive user interface that can represent a global map and that can support interactions. This interface should also provide details about a given dataset. Finally, the size of the collection (sev- eral hundred thousands of datasets) will have an important im- pact. Therefore, we got inspiration from several approaches in data processing, text mining, topological learning, parallel pro- gramming on multi-core architecture and graph visualization to produce our tool called EXOD. The main contributions of our work are thus the following: • we adapt and combine existing techniques to the context of Open Data mining, • we propose a GPU-based implementation of a proximity graph building method, • we perform several tests with real and large open datasets. The remainder of this paper is organized as follows: sec- Preprint submitted to Computers & Graphics December 7, 2014