A sequential algorithm for sparse support vector classifiers Jian-Xun Peng n , Stuart Ferguson, Karen Rafferty, Victoria Stewart The School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast BT9 5AH, UK article info Article history: Received 19 October 2011 Received in revised form 2 October 2012 Accepted 8 October 2012 Available online 16 October 2012 Keywords: Support vector classifier Sequential algorithm Sparse design abstract Support vector machines (SVMs), though accurate, are not preferred in applications requiring high classification speed or when deployed in systems of limited computational resources, due to the large number of support vectors involved in the model. To overcome this problem we have devised a primal SVM method with the following properties: (1) it solves for the SVM representation without the need to invoke the representer theorem, (2) forward and backward selections are combined to approach the final globally optimal solution, and (3) a criterion is introduced for identification of support vectors leading to a much reduced support vector set. In addition to introducing this method the paper analyzes the complexity of the algorithm and presents test results on three public benchmark problems and a human activity recognition application. These applications demonstrate the effectiveness and efficiency of the proposed algorithm. & 2012 Published by Elsevier Ltd. 1. Introduction The support vector machine (SVM) is a class of learning system that delivers state-of-the-art performance in real world data classification problems such as image classification [31,30], char- acter recognition [8,33], text categorization [16]. SVM classification techniques are firmly grounded in the framework of VC theory for binary classification problems proposed by Vapnik and Chervonenkis [37]. Conceptually, an SVM maps input vectors into a so-called feature space, which is of higher dimension than the input space. In the feature space, a linear decision surface is constructed based on the structural risk minimization (SRM) principle, i.e. to minimize an upper bound on the expected risk, the expectation of the test error of an SVM on an unseen point [38]. SVM implements the SRM principle by maximizing the margin while minimizing the error. The margin defined by an SVM refers to the distance between the two parallel hyperplanes in the feature space that bound the training points of two classes, respectively. The SRM principle implemented by SVM overcomes the difficulties with generalization that have suffered by traditional neural networks. Even though yielding very accurate solutions, SVMs are not preferred in real-time applications especially for systems with limited computational resources (e.g. available RAM and CPU resources). This is due to the fact that a large set of support vectors (SVs) is usually needed to form the SVM classifier, making it computationally complex and expensive to implement an SVM. The number of SVs of an SVM, denoted as n SV , determines its complexity. Steinwart [35] determined theoretically that n SV linearly grows in proportion to the number of training points N in prob- ability. This reveals that, for large problems, n SV can be large and thus the training and testing complexities might become prohibitive in practice since they are respectively, Oðn SV N þ n 3 SV Þ and Oðn SV Þ. For these reasons, there has been increasing interest in seeking sparse (approximative) representations of standard (accurate) SVMs. Techniques proposed for reducing the size of an SVM classifier (measured by the number of support vectors) fall into two classes: post-training algorithms and algorithms that yield sparse SVMs (referred to as sparse SVM algorithms). Post-training algorithms essentially approximate the normal vector to the separating hyperplane of a standard SVM which is expressed as a linear expansion with the support vectors in the feature space. Linear expansions of smaller sizes are used to approximate by minimizing the Euclidean distance between the approximating normal and the original one. The approximated SVM discriminant function is thus expressed as the inner product of the approximating normal and an input vector in the feature space. For examples, Burges and Scholkopf [3] apply nonlinear optimization methods to seek a reduced set of vectors and the corresponding expansion coefficients for sparse representations of standard SVM classifiers. However this method does not work when a further reduction to the support vector set is desired. Li et al. [20] constructed a reduced set by selecting from the support vector solutions one element at a time, based on the use of the vector correlation principle and a greedy strategy. How- ever, this greedy search does not guarantee local optimality. Along similar lines, Scholkopf et al. [32] discussed the connec- tion between feature space and input space by dealing with the Contents lists available at SciVerse ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2012 Published by Elsevier Ltd. http://dx.doi.org/10.1016/j.patcog.2012.10.007 n Corresponding author. Tel.: þ44 28 90974480. E-mail address: j.peng@qub.ac.uk (J.-X. Peng). Pattern Recognition 46 (2013) 1195–1208