Multi-objective learning of Relevance Vector Machine classiﬁers with multi-resolution kernels Andrew R.J. Clark, Richard M. Everson n Department of Computer Science, College of Engineering, Mathematics and Physical Sciences, University of Exeter, EX4 4QF, UK article info Article history: Received 17 February 2011 Received in revised form 31 January 2012 Accepted 19 February 2012 Available online 7 March 2012 Keywords: Relevance Vector Machine Evolutionary algorithm Classiﬁcation Multi-resolution kernels Cross-validation abstract The Relevance Vector Machine (RVM) is a sparse classiﬁer in which complexity is controlled with the Automatic Relevance Determination prior. However, sparsity is dependent on kernel choice and severe over-ﬁtting can occur. We describe multi-objective evolutionary algorithms (MOEAs) which optimise RVMs allowing selection of the best operating true and false positive rates and complexity from the Pareto set of optimal trade-offs. We introduce several cross-validation methods for use during evolutionary optimisation. Comparisons on benchmark datasets using multi-resolution kernels show that the MOEAs can locate markedly sparser RVMs than the standard, with comparable accuracies. & 2012 Elsevier Ltd. All rights reserved. 1. Introduction The Relevance Vector Machine (RVM) [1,2] and its faster implementation, the fast RVM (fRVM) [3,4], produces sparse probabilistic models for pattern recognition problems. As with the Support Vector Machine (SVM), use of the kernel trick [5] allows models to be built in high-dimensional feature spaces at low computational cost, but with the advantage of a probabilistic formulation. Through use of the Automatic Relevance Determina- tion (ARD) prior [6], outlined in Section 2, the RVM ‘switches off’ basis functions for which there is little or no support in the data, thus producing a sparse representation. However, sparsity is controlled not only by the ARD prior, but also through the choice of kernel [7]. In particular severe over-ﬁtting occurs when multi- resolution kernels are employed. For regression problems this has been dealt with through the use of a smoothness prior [7] but that methodology does not easily carry over to classiﬁcation. Addi- tionally, it is often unclear what the costs of misclassiﬁcation are and commonly one wants to assess performance over a range of misclassiﬁcation costs rather than a single one, typically through the use of the Receiver Operating Characteristic (ROC) curve (e.g., [8]). To address these problems we propose a multi-objective optimisation method using an evolutionary algorithm in which we simultaneously optimise not only an RVM’s true positive rate, T, and false positive rate, F, but also a new measure of the model complexity, C. By simultaneously optimising T, F and C we generate an approximation to the Pareto Front containing the best trade-off between true positive rate, false positive rate and model complexity: at any ﬁxed complexity the Pareto front may be regarded as the ROC curve optimised over RVMs of that complexity. Through this control of the model complexity we reduce over-ﬁtting and so produce sparser models with equiva- lent generalisation performance. In common with most training procedures, multi-objective evolutionary algorithms (MOEAs) are prone to over-ﬁtting on a training set and we therefore examine schemes to control over-ﬁtting during the evolutionary optimisa- tion process and look at their effects on test error rates. We begin by outlining the theoretical basis of the RVM and multi-objective optimisation, before presenting, in Section 3, our multi-objective optimising evolutionary algorithm and providing a comparison of the fRVM and the MOEA. We then describe in Section 4 the different cross-validation schemes investigated and present results in Section 5 on a number of benchmark binary classiﬁcation tasks in terms of accuracy and area under the ROC curve. 2. Relevance Vector Machines The Relevance Vector Machine models binary classiﬁcation using a logistic link function to map a linear combination of basis functions to a posterior class probability; thus if x is an input Contents lists available at SciVerse ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2012.02.025 n Corresponding author. Tel.: þ44 1392 724065. E-mail addresses: Andrew.Clark@exeter.ac.uk (A.R.J. Clark), R.M.Everson@exeter.ac.uk (R.M. Everson). Pattern Recognition 45 (2012) 3535–3543