Optimal Reduced Set for Sparse Kernel Spectral Clustering Raghvendra Mall, Siamak Mehrkanoon, Rocco Langone and Johan A.K. Suykens Abstract— Kernel Spectral Clustering (KSC) solves a weighted kernel principal component analysis problem in a primal-dual optimization framework. It results in a clustering model using the dual solution of the problem. It has a powerful out-of-sample extension property leading to good clustering generalization w.r.t. the unseen data points. The out-of-sample extension property allows to build a sparse model on a small training set and introduces the first level of sparsity. The clustering dual model is expressed in terms of non-sparse kernel expansions where every point in the training set contributes. The goal is to find reduced set of training points which can best approximate the original solution. In this paper a second level of sparsity is introduced in order to reduce the time complexity of the computationally expensive out-of-sample extension. In this paper we investigate various penalty based reduced set techniques including the Group Lasso, L0, L1 +L0 penalizations and compare the amount of sparsity gained w.r.t. a previous L1 penalization technique. We observe that the optimal results in terms of sparsity corresponds to the Group Lasso penalization technique in majority of the cases. We showcase the effectiveness of the proposed approaches on several real world datasets and an image segmentation dataset. I. I NTRODUCTION Clustering algorithms are widely used tools in fields like data mining, machine learning, graph compression and many other tasks. The aim of clustering is to divide data into natural groups present in a given dataset. Clusters are defined such that the data present within the group are more similar to each other in comparison to the data between clusters. Spectral clustering methods [1], [2] and [3] are generally better than the traditional k-means techniques. A new Kernel Spectral Clustering (KSC) algorithm based on weighted kernel PCA formulation was proposed in [4]. The method was based on a model built in a primal-dual optimization framework. The model had a powerful out- of-sample extension property which allows to infer cluster affiliation for unseen data. The KSC methodology has been extensively applied for task of data clustering [4], [5], [6], [7] and community detection [8], [9], [10] in large scale networks. The data points are projected to the eigenspace and the projections are expressed in terms of non-sparse kernel expansions. In [5], a method to sparsify the clustering model was proposed by exploiting the line structure of the projections when the clusters are well formed and well separated. However, the method fails when the clusters are overlapping and for real world datasets where the projections in the eigenspace do not follow a line structure as mentioned Raghvendra Mall, Siamak Mehrkanoon, Rocco Langone and Jo- han A.K. Suykens are with the Department of Electrical Engineering, Katholieke Universiteit Leuven (email: {raghvendra.mall, rocco.langone, siamak.mehrkanoon, johan.suykens}@esat.kuleuven.be). in [6]. In [6], the authors used an L 2 + L 1 penalization to produce a reduced set to approximate the original solution vector. Although the authors propose it as an L 2 + L 1 penalization technique, the actual penalty on the weight vectors is L 1 penalty and the loss function is squared loss function and hence the name. Therefore in this paper we refer to the previous proposed approach as L 1 penalization technique. It is well known that the L 1 regularization introduces sparsity as shown in [11]. However, the resulting reduced set is neither the sparsest nor the optimal w.r.t. the quality of clustering for the entire dataset. In this paper we propose to use alternative penalization techniques like Group Lasso [12] and [13], L 0 and L 1 + L 0 penalizations. The Group Lasso penalty is ideal for clusters as it results in groups of relevant data points. The L 0 regularization calculates the number of non-zero terms in the vector. The L 0 -norm results in a non-convex and NP-hard optimization problem. We modify the convex relaxation of L 0 -norm based iterative sparsification procedure introduced in [14] for classification. We apply it to obtain the reduced sets for sparse kernel spectral clustering. The main advantage of these sparse reductions is that it results in much simpler and faster predictive models. It allows to reduce the time complexity for the computationally expensive out-of-sample extensions and also reduces the memory requirements for building the test kernel matrix. II. KERNEL SPECTRAL CLUSTERING We first provide a brief description of the kernel spectral clustering methodology according to [4]. A. Primal-Dual Weighted Kernel PCA framework Given a dataset D = {x i } Ntr i=1 , x i ∈ R d , the training points are selected by maximizing the quadratic R` enyi criterion as depicted in [6], [15] and [18]. This introduces the first level of sparsity by building the model on a subset of the dataset. Here x i represents the i th training data point. The number of data points in the training set is N tr . Given D and the number of clusters k, the primal problem of the spectral clustering via weighted kernel PCA is formulated as follows [4]: min w (l) ,e (l) ,b l 1 2 k−1 l=1 w (l) ⊺ w (l) − 1 2N tr k−1 l=1 γ l e (l) ⊺ D −1 Ω e (l) such that e (l) =Φw (l) + b l 1 Ntr ,l =1,...,k − 1, (1) where e (l) =[e (l) 1 ,...,e (l) Ntr ] ⊺ are the projections onto the eigenspace, l =1,...,k − 1 indicates the number of score variables required to encode the k clusters, D −1 Ω ∈ R Ntr ×Ntr