c Springer-Verlag Probabilistic Discriminative Kernel Classifiers for Multi-class Problems Volker Roth University of Bonn Department of Computer Science III Roemerstr. 164 D-53117 Bonn Germany roth@cs.uni-bonn.de Abstract. Logistic regression is presumably the most popular representative of probabilistic discriminative classifiers. In this paper, a kernel variant of logistic regression is introduced as an iteratively re-weighted least-squares algorithm in kernel-induced feature spaces. This formulation allows us to apply highly effi- cient approximation methods that are capable of dealing with large-scale prob- lems. For multi-class problems, a pairwise coupling procedure is proposed. Pair- wise coupling for “kernelized” logistic regression effectively overcomes concep- tual and numerical problems of standard multi-class kernel classifiers. 1 Introduction Classifiers can be partitioned into two main groups, namely informative and discrim- inative ones. In the informative approach, the classes are described by modeling their structure, i.e. their generative statistical model. Starting from these class models, the posterior distribution of the labels is derived via the Bayes formula. The most popular method of informative kind is classical Linear Discriminant Analysis (LDA). However, the informative approach has a clear disadvantage: modeling the classes is usually a much harder problem than solving the classification problem directly. In contrast to the informative approach, discriminative classifiers focus on model- ing the decision boundaries or the class probabilities directly. No attempt is made to model the underlying class densities. In general, they are more robust as informative ones, since less assumptions about the classes are made. The most popular discrimina- tive method is logistic regression (LOGREG), [1]. The aim of logistic regression is to produce an estimate of the posterior probability of membership in each of the c classes. Thus, besides predicting class labels, LOGREG additionally provides a probabilistic confidence measure about this labeling. This allows us to adapt to varying class priors. A different approach to discriminative classification is given by the Support Vector (SV) method. Within a maximum entropy framework, it can be viewed as the discrim- inative model that makes the least assumptions about the estimated model parameters,