VOKNN: Voting-based Nearest Neighbor Approach for Scalable SVM Training Saeed Salem Department of Computer Science North Dakota State University Fargo, ND 58102, USA saeed.salem@ndsu.edu Khedidja Seridi INRIA Dolphin Project Opac LIFL CNRS Villeneuve d’Ascq, France khedidja.seridi@inria.fr Loqmane Seridi Math. and Comp. Sci. & Eng. Division King Abdullah Uni. of Sci. and Tech. Thuwal, Saudi Arabia 23955-6900 loqmane.seridi@kaust.edu.sa Jianfei Wu Department of Computer Science North Dakota State University Fargo, ND 58102, USA jianfei.wu@ndsu.edu Mohammed J. Zaki Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180, USA zaki@cs.rpi.edu ABSTRACT Support Vector Machines (SVMs) are a powerful classifica- tion technique. For large datasets, SVM training is compu- tationally expensive. In this paper, we propose VOKNN, a novel nearest neighbor-based voting data reduction algo- rithm for SVM training. The proposed approach eliminates part of the training dataset based on the votes each point gains from all the other points in the other class. We demon- strate the effectiveness and efficiency of VOKNN on sev- eral real datasets. SVM classification models built on the reduced datasets achieve comparable classification accuracy as those built on the original training datasets. In few cases, SVM classification accuracy has improved significantly when the SVM is trained on a reduced training dataset. Categories and Subject Descriptors H.2.8 [Data mining]: Support Vector Machines, Kernel Nearest Neighbors 1. INTRODUCTION & RELATED WORK Corresponding Author. This author contributed to this work while visiting Rens- selaer Polytechnic Institute. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD-LDMTA’10, July 25, 2010, Washington, DC, USA. Copyright 2010 ACM 978-1-4503-0215-9/10/07 ...$10.00. Support vector machines (SVM) [13] have attracted sig- nificant attention from the research community, thanks in part to their classification power. Recently it has been voted among the top 10 most influential data mining algorithms [15]. The classification power of SVM can be attributed to its excellent generalization performance [3]. One of the key features of SVM is its ability to learn non-linear decision functions. It does so by mapping the data from the original space, L, into a higher dimensional Euclidean space, H [3]. The SVM then tries to find a linear separation between the classes in the new feature space. SVM utilizes the kernel function approach to compute the dot product of the points in the new space without having to map every point to its image in the new space. Define a mapping function Φ : L→ H. For every two points, xi and xj , in the original space, the kernel trick allows the computation of the dot product in the new space in terms of the coordinates of the points in the original space, i.e., < Φ(xi ).Φ(xj ) > = f (xi , xj ). The training step of SVM involves solving a large quadratic programming (QP) optimization problem which is computa- tionally expensive, specially for large training datasets. For large datasets, the computational complexity of SVM train- ing is O(n 2 ), where n is the number of training points [3, 14]. To overcome the computational complexity of SVM training, several methods have been proposed which try to speed up SVM training. Many of these methods are aimed at speed- ing up the solution of the quadratic programming problem. The Chunking algorithm considers a chunk of the dataset iteratively [11]. The Chunking approach starts by a random subset of the data and iteratively keeps all the non-zero La- grange multipliers and adds the examples that violate the optimality conditions [11]. Another method for speeding up SVM training is the Sequential Minimal Optimization (SMO) which breaks the large QP optimization problem into smaller more manageable QP problems [10]. The scalability