1045-9219 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2731764, IEEE Transactions on Parallel and Distributed Systems IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1 A GPU-Architecture Optimized Hierarchical Decomposition Algorithm for Support Vector Machine Training Jan Vanˇ ek, Josef Mich ´ alek, and Josef Psutka Abstract—In the last decade, several GPU implementations of Support Vector Machine (SVM) training with nonlinear kernels were published. Some of them even with source codes. The most effective ones are based on Sequential Minimal Optimization (SMO). They decompose the restricted quadratic problem into a series of smallest possible subproblems, which are then solved analytically. For large datasets, the majority of elapsed time is spent by a large amount of matrix-vector multiplications that cannot be computed efﬁciently on current GPUs because of limited memory bandwidth. In this paper, we introduce a novel GPU approach to the SVM training that we call Optimized Hierarchical Decomposition SVM (OHD-SVM). It uses a hierarchical decomposition iterative algorithm that ﬁts better to actual GPU architecture. The low decomposition level uses a single GPU multiprocessor to efﬁciently solve a local subproblem. Nowadays a single GPU multiprocessor can run thousand or more threads that are able to synchronize quickly. It is an ideal platform for a single kernel SMO-based local solver with fast local iterations. The high decomposition level updates gradients of entire training set and selects a new local working set. The gradient update requires many kernel values that are costly to compute. However, solving a large local subproblem offers an efﬁcient kernel values computation via a matrix-matrix multiplication that is much more efﬁcient than the matrix-vector multiplication used in already published implementations. Along with a description of our implementation, the paper includes an exact comparison of ﬁve publicly available C++ SVM training GPU implementations. In this paper, the binary classiﬁcation task and RBF kernel function are taken into account as it is usual in most of the recent papers. According to the measured results on a wide set of publicly available datasets, our proposed approach excelled signiﬁcantly over the other methods in all datasets. The biggest difference was on the largest dataset where we achieved speed-up up to 12 times in comparison with the fastest already published GPU implementation. Moreover, our OHD-SVM is the only one that can handle dense as well as sparse datasets. Along with this paper, we published the source-codes at https://github.com/OrcusCZ/OHD-SVM. Index Terms—Support Vector Machines, SVM Training, GPU, CUDA, Optimization ✦ 1 I NTRODUCTION S UPPORT Vector Machines are popular general purpose learning methods. They offer a good generalization ability through maximizing of the margin controlled by a manual setting regularization constant. SVMs can also deal with a high variability of problems because of a user-deﬁned kernel function. SVMs were originally developed for binary classiﬁcation, but the multi-class variant is also possible. Despite the boom of artiﬁcial neural networks, SVMs are still used in many domains, for example in the economy (customers churn prediction and marketing retention starte- gies [1], credit scoring [2]), landscape surveys (landslide susceptibility mapping [3], endangered tree species map- ping [4]), medicine (breast cancer mammography recogni- tion [5]), and chemistry and biotechnology [6]. In recent years, several new variants of SVMs were introduced: Semi- Supervised SVMs [7], Twin SVMs [8], Generalized Eigen- value Proximal SVMs [9], and Nonparallel SVMs [10]. Training an SVM amounts to solving a quadratic pro- gramming problem. A good overview of optimization tech- niques can be found in [11]. Very efﬁcient solutions were developed especially for linear or linearized SVMs [12], [13], [14], [15]. Nonlinear SVMs solvers are mostly based on • The authors are with the University of West Bohemia, New Technologies for the Information Society, Pilsen, Czech Republic E-mail: {vanekyj, orcus, psutka}@ntis.zcu.cz a decomposition technique in the dual formulation of the SVM criterion. The most frequent approach is SMO with the subset of two components which has a simple analytical solution introduced by Platt in [16]. The two-components SMO was generalized to a three-components SMO by Lin in [17]. In contrast, Joachims solves small subproblems by Cholesky factorization [18]. A decomposition technique with a gradient projection of subproblems was proposed by Zani in [19]. Platt’s SMO was further improved by Keerthi in [20]. Fan implemented a LibSVM which is based on Keerthi improved SMO [21], and it is still used as a reference due to a robust working set heuristic, a kernel caching, and a shrinking technique. However, large SVM problems require high- performance implementations to train a model in reasonable time. One option for large dense datasets is to compute a kernel function via CPU optimized Intel or AMD libraries which also have multi-core support. More advanced multi- core and multi-node CPU implementations were described by Elad, Cao, Goncalves, and You in [22], [23], [24], and [25], respectively. Dong in [26] proposed an approach based on a reduced block-diagonal Gram matrix and Graf in [27] proposed an SVM cascade that has similar behavior: faster elimination of non-support vectors. In the last decade, GPU computation power has been utilized by machine learning applications widely. Because