Analytica Chimica Acta 664 (2010) 27–33 Contents lists available at ScienceDirect Analytica Chimica Acta journal homepage: www.elsevier.com/locate/aca Multi-class classification with probabilistic discriminant partial least squares (p-DPLS) Néstor F. Pérez, Joan Ferré , Ricard Boqué Department of Analytical Chemistry and Organic Chemistry, Rovira and Virgili University, C/Marcel·lí Domingo, s/n. 43007, Tarragona, Spain article info Article history: Received 19 October 2009 Received in revised form 22 January 2010 Accepted 29 January 2010 Available online 6 February 2010 Keywords: Reliability Multi-class classification Discriminant partial least squares Probabilistic DPLS abstract This work describes multi-classification based on binary probabilistic discriminant partial least squares (p-DPLS) models, developed with the strategy one-against-one and the principle of winner-takes-all. The multi-classification problem is split into binary classification problems with p-DPLS models. The results of these models are combined to obtain the final classification result. The classification criterion uses the specific characteristics of an object (position in the multivariate space and prediction uncertainty) to estimate the reliability of the classification, so that the object is assigned to the class with the highest reliability. This new methodology is tested with the well-known Iris data set and a data set of Italian olive oils. When compared with CART and SIMCA, the proposed method has better average performance of classification, besides giving a statistic that evaluates the reliability of classification. For the olive oil set the average percentage of correct classification for the training set was close to 84% with p-DPLS against 75% with CART and 100% with SIMCA, while for the test set the average was close to 94% with p-DPLS as against 50% with CART and 62% with SIMCA. © 2010 Elsevier B.V. All rights reserved. 1. Introduction In multi-class classification problems we have an I × J set X of J observed variables in I training objects, a vector y that codifies the class c (c = 1,..., C; with C > 2) of each object and a vector x of variables measured for the unknown object that must be assigned to one (or none) of the C possible classes. Examples of multi-class classification problems are the assignation of food commodities to one out of several possible origins [1,2] and the identification of different tumour types from microarray gene expression data [3,4]. A multi-class classification problem is solved by using an ade- quate classifier decision function that maps x onto a class label [5]. One approach is to use a single classification function like in k- Nearest Neighbours (k-NN) [6], or Artificial Neural Networks (ANN) [7]. In these cases, the classification of an object in one of the C classes is done in one step. Another approach is to divide the multi-class problem into K smaller classification problems, each one with its own decision rule, and then combine the output of the K individual classifications to obtain the final result. This can be done not only by using single-class models, such as the Soft Independent Modelling of Class Analogy (SIMCA) method [1], but also by using binary classification methods that have to decide between two classes or super-classes. The latter approach is known as dichotomization or binarization [1] and has the advantage that a Corresponding author. Tel.: +34 977 55 9564; fax: +34 977 55 8446. E-mail address: joan.ferre@urv.cat (J. Ferré). wide range of binary classification methods, such as Support Vector Machines [5] and discriminant partial least squares (DPLS) [8] can be used. There are three possible ways of splitting the classes for binary classifiers: one-against-all (where “all” means “the rest”), one- against-one and P-against-Q [9]. In all cases the original vector y is replaced by another one that codifies with a “1” the objects that belong to the class or classes of interest, and with a “0” the objects that do not belong to the classes of interest. The strategy P-against- Q (PAQ) first splits the data into two groups, one with P classes, and one with the remaining Q classes. At the next level, the classes in P are also split into two groups, and a binary model that dis- criminate between them is calculated. The division is also done for the classes in Q. The split continues at the successive levels until models only discriminate between two classes. This procedure can solve a classification problem of C classes using C-1 binary models [6]. The drawback of hierarchical PAQ is that an error of allocation in one node results in the object being misclassified. The strategy one-against-all (OAA) is similar to PAQ, in which P only contains one class, and Q contains the remaining C-1 classes. In this case, the problem is solved either hierarchically (which involves C-1 models) or by simultaneous combination of C binary models [5]. A weak- ness of OAA is that the number of objects of the class of interest can be imbalanced with respect to the other super-class that contains the rest of the objects. Moreover, incompatible classes (that could be correctly discriminated if they were modelled one against the others) are grouped together, thus forcing the model to consider opposite classes as a unique super-class. Finally, the strategy one- 0003-2670/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.aca.2010.01.059