Analytica Chimica Acta 664 (2010) 27–33 Contents lists available at ScienceDirect Analytica Chimica Acta journal homepage: www.elsevier.com/locate/aca Multi-class classiﬁcation with probabilistic discriminant partial least squares (p-DPLS) Néstor F. Pérez, Joan Ferré ∗ , Ricard Boqué Department of Analytical Chemistry and Organic Chemistry, Rovira and Virgili University, C/Marcel·lí Domingo, s/n. 43007, Tarragona, Spain article info Article history: Received 19 October 2009 Received in revised form 22 January 2010 Accepted 29 January 2010 Available online 6 February 2010 Keywords: Reliability Multi-class classiﬁcation Discriminant partial least squares Probabilistic DPLS abstract This work describes multi-classiﬁcation based on binary probabilistic discriminant partial least squares (p-DPLS) models, developed with the strategy one-against-one and the principle of winner-takes-all. The multi-classiﬁcation problem is split into binary classiﬁcation problems with p-DPLS models. The results of these models are combined to obtain the ﬁnal classiﬁcation result. The classiﬁcation criterion uses the speciﬁc characteristics of an object (position in the multivariate space and prediction uncertainty) to estimate the reliability of the classiﬁcation, so that the object is assigned to the class with the highest reliability. This new methodology is tested with the well-known Iris data set and a data set of Italian olive oils. When compared with CART and SIMCA, the proposed method has better average performance of classiﬁcation, besides giving a statistic that evaluates the reliability of classiﬁcation. For the olive oil set the average percentage of correct classiﬁcation for the training set was close to 84% with p-DPLS against 75% with CART and 100% with SIMCA, while for the test set the average was close to 94% with p-DPLS as against 50% with CART and 62% with SIMCA. © 2010 Elsevier B.V. All rights reserved. 1. Introduction In multi-class classiﬁcation problems we have an I × J set X of J observed variables in I training objects, a vector y that codiﬁes the class c (c = 1,..., C; with C > 2) of each object and a vector x of variables measured for the unknown object that must be assigned to one (or none) of the C possible classes. Examples of multi-class classiﬁcation problems are the assignation of food commodities to one out of several possible origins [1,2] and the identiﬁcation of different tumour types from microarray gene expression data [3,4]. A multi-class classiﬁcation problem is solved by using an ade- quate classiﬁer decision function that maps x onto a class label [5]. One approach is to use a single classiﬁcation function like in k- Nearest Neighbours (k-NN) [6], or Artiﬁcial Neural Networks (ANN) [7]. In these cases, the classiﬁcation of an object in one of the C classes is done in one step. Another approach is to divide the multi-class problem into K smaller classiﬁcation problems, each one with its own decision rule, and then combine the output of the K individual classiﬁcations to obtain the ﬁnal result. This can be done not only by using single-class models, such as the Soft Independent Modelling of Class Analogy (SIMCA) method [1], but also by using binary classiﬁcation methods that have to decide between two classes or super-classes. The latter approach is known as dichotomization or binarization [1] and has the advantage that a ∗ Corresponding author. Tel.: +34 977 55 9564; fax: +34 977 55 8446. E-mail address: joan.ferre@urv.cat (J. Ferré). wide range of binary classiﬁcation methods, such as Support Vector Machines [5] and discriminant partial least squares (DPLS) [8] can be used. There are three possible ways of splitting the classes for binary classiﬁers: one-against-all (where “all” means “the rest”), one- against-one and P-against-Q [9]. In all cases the original vector y is replaced by another one that codiﬁes with a “1” the objects that belong to the class or classes of interest, and with a “0” the objects that do not belong to the classes of interest. The strategy P-against- Q (PAQ) ﬁrst splits the data into two groups, one with P classes, and one with the remaining Q classes. At the next level, the classes in P are also split into two groups, and a binary model that dis- criminate between them is calculated. The division is also done for the classes in Q. The split continues at the successive levels until models only discriminate between two classes. This procedure can solve a classiﬁcation problem of C classes using C-1 binary models [6]. The drawback of hierarchical PAQ is that an error of allocation in one node results in the object being misclassiﬁed. The strategy one-against-all (OAA) is similar to PAQ, in which P only contains one class, and Q contains the remaining C-1 classes. In this case, the problem is solved either hierarchically (which involves C-1 models) or by simultaneous combination of C binary models [5]. A weak- ness of OAA is that the number of objects of the class of interest can be imbalanced with respect to the other super-class that contains the rest of the objects. Moreover, incompatible classes (that could be correctly discriminated if they were modelled one against the others) are grouped together, thus forcing the model to consider opposite classes as a unique super-class. Finally, the strategy one- 0003-2670/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.aca.2010.01.059