Neural pairwise classification models created by ignoring irrelevant alternatives Ondrej Šuch 1,2 , Martin Kontšek 2 , and Andrea Tinajová 1 1 Mathematical Institue, Slovak Academy of Sciences 955 01 Banská Bystrica, Slovakia ondrejs@savbb.sk, 2 Žilinská Univerzita v Žiline, Univerzitná 8215/1, 010 26 Žilina Abstract: It is possible to construct multiclass classifi- cation models from binary classifiers trained in pairwise (one-on-one) manner. Important examples of models cre- ated in this way are support vector machines applied to multiclass problems. In this work we examine feasibility of this approach for convolutional neural networks. We examine multiple ways to train pairwise classification net- works for MNIST dataset, and multiple ways to combine them into a multiclass classifier for MNIST classification problem. Our experimental results show definite promise of this approach, especially in reducing complexity of deep neural networks. An important unresolved question of our approach is how to choose the best pairwise network to include into a full multi-class model. Keywords: MNIST, convolutional network, pairwise cou- pling, one-on-one classification, binary classification, dropout 1 Introduction Deep neural networks are currently the most powerful type of classifiers applicable for a multitude of machine learn- ing problems [1]. Perhaps their biggest drawback is their complexity, which manifests in multiple ways. First, they require preparation and use of large datasets to attain the best precision [2]. They need a lot of specialized com- puting power for training [3] and the training process may take a long time. Finally, the classification process is ob- scured by their complexity, which makes it harder to un- derstand their weaknesses and guarantee performance on unseen instances. In this article we consider the question, whether the classification process using deep neural net- works could be made more modular, alleviating the draw- backs resulting from the complexity. The approach is inspired by research on support vec- tor machines. Support vector machines were proposed by Vapnik as a general purpose classifier [4]. They are still popular to this date for a variety of classification tasks. Since SVM work by dividing feature space (or its embed- ding into a higher-dimensional space [5]) into two parts by a hyperplane, they are naturally suited for two-class Copyright c 2019 for this paper by its authors. Use permitted un- der Creative Commons License Attribution 4.0 International (CC BY 4.0). problems. Multi-class classification with SVM is accom- plished by training SVM for pairs of classes, then fitting sigmoid to obtain pairwise probabilities [6] and finally us- ing a pairwise coupling method to obtain multiclass pre- diction probabilities [7]. The same approach can be applied to deep neural net- works. An additional simplification compared to SVM is that the step of fitting sigmoid is not necessary, since neu- ral networks typically use soft-max for their final layers that directly output prediction probabilities. On the other hand, compared to SVM, the process of training of neural networks allows for many more hyperparameters. In this paper we will carry out the process on MNIST digit classification task [8] illustrating the potential and possible pitfalls of the approach (Figure 1). Even for the basic MNIST task we had to severely limit the number of investigated training procedures. A key restriction we adopted is that we trained the two-class networks only with examples belonging to the corresponding two classes. Such approach, dubbed ‘ignoring irrelevant alternatives’, promises to speed up the training process for the two-class networks by reducing the size of the training dataset as well as to cut the time needed to train the whole multi-class model. Moreover, it is philosophically consistent with the assumption of the independence of irrelevant alternatives in the softmax layer commonly used in neural networks. Our restriction, and pairwise decomposition of multi- class classification itself are not without potential prob- lems. A key issue is that of extrapolating prediction prob- abilities to classes that a two-class classifier has never seen during training (e.g. the insight of G. Hinton described in the work of Hastie and Tibshirani [9]). Another potential problem to guard against is that the proposed classifica- tion scheme may require much more parameters since as much as 10 x 9 / 2 = 45 neural networks need to be trained instead of one. 2 Methodology outline MNIST dataset of handwritten digits (zero to nine) is a widely used benchmark task in which convolutional net- works proved quite successful [8]. It consists of 60000 training samples and 10000 testing samples. Throughout this work we will use 8-layer feed-forward networks de-