Representative Subsets For Big Data Learning using k -NN Graphs Raghvendra Mall, Vilen Jumutc, Rocco Langone, Johan A.K. Suykens KU Leuven, ESAT/STADIUS Kasteelpark Arenberg 10, B-3001 Leuven, Belgium {raghvendra.mall,vilen.jumutc,rocco.langone,johan.suykens}@esat.kuleuven.be Abstract—In this paper we propose a deterministic method to obtain subsets from big data which are a good representative of the inherent structure in the data. We first convert the large scale dataset into a sparse undirected k-NN graph using a distributed network generation framework that we propose in this paper. After obtaining the k-NN graph we exploit the fast and unique representative subset (FURS) selection method [1], [2] to deterministically obtain a subset for this big data network. The FURS selection technique selects nodes from different dense regions in the graph retaining the natural community structure. We then locate the points in the original big data corresponding to the selected nodes and compare the obtained subset with subsets acquired from state-of-the-art subset selection techniques. We evaluate the quality of the selected subset on several synthetic and real-life datasets for different learning tasks including big data classification and big data clustering. I. I NTRODUCTION In the modern era with the advent of new technologies and its widespread usage there is a huge proliferation of data. This immense wealth of data has resulted in massive datasets and has led to the emergence of the concept of Big Data. However, the choices for selecting a predictive model for Big Data learning is limited as only a few tools scale to large scale datasets. One way is to develop effi- cient learning algorithms which are fast, scalable and might use parallelization or distributed computing. Recently, a tool named Mahout [3] (http://www.manning.com/owen/) was built which implemented several machine learning techniques for big data using a distributed Hadoop [5] framework. The other direction is sampling [6] and [7]. There are several machine learning algorithms which build predictive models on a small representative subset of the data [2], [8]–[13] with out-of- sample extensions properties. This property allows inference for previously unseen part of the large scale data. The methods which belong to this class include kernel based methods, similarity based methods, prototype learning methods, instance based methods, manifold learning etc. Sampling [14] is concerned with selection of points as a subset which can be used to estimate characteristics of the whole dataset. The main disadvantage of probabilistic sam- pling techniques is that every time the algorithm runs different subsets are obtained. It often results in large variations in the performance. Another disadvantage is that most probabilistic sampling techniques cannot capture some characteristic of the data like the inherent cluster structure unless the cluster information is available in advance. However, in case of real- life datasets this information is not known beforehand and is learnt by unsupervised learning techniques. In this paper we propose a framework to overcome these problems and select representative subsets that retain the natural cluster structure present in the data. We first convert the big data into an undirected and weighted k-Nearest Neighbor (k-NN) [15], [16] graph where each node represents a data point and each edge represents the similarity between the data points. In this paper we propose a distributed environment to convert big data into this k- NN graph. After obtaining the k-NN graph we use the fast and unique representative subset (FURS) selection technique proposed in [1] and [2]. We propose a simple extension of FURS method to handle the case of weighted graphs. FURS selects nodes from different dense regions in the graph while retaining the inherent community structure. Finally, we map these selected nodes to the points in the original data. These points capture the intrinsic cluster structure present in the data. We compare and evaluate the resulting subset with other sampling techniques like simple random sampling [6], stratified-random sampling [7] and a subset selection technique based on maximizing the R` enyi entropy criterion [17] and [8]. For classification, we use the subset to build a subsampled- dual least squares support vector machine (SD-LSSVM) model as proposed in [10] and use the out-of-sample extensions property to determine the class labels for points in the big data. For clustering we utilize the kernel spectral clustering (KSC) method proposed in [11]. We build the training model on the subset and again use the out-of-sample extension property of the model to infer cluster affiliation for the entire dataset. Figure 1 represents the flow chart of the steps undertaken. II. DISTRIBUTED k -NN GRAPH GENERATION FRAMEWORK In this section we describe a parallel approach for network generation from the kernel matrix. The kernel matrix is in general a full matrix and a full graph can be generated corresponding to the kernel matrix. However, most real-life datasets have underlying sparsity i.e. each point in the dataset is similar to only a few other points in the big data. Hence, we propose to use the k-NN graph [15] and [16] for representing the big data. Now, we will present a resilient way of handling big and massive datasets just by sequencing and distributing computations in a smart way.