Neural pairwise classiﬁcation models created by ignoring irrelevant alternatives Ondrej Šuch 1,2 , Martin Kontšek 2 , and Andrea Tinajová 1 1 Mathematical Institue, Slovak Academy of Sciences 955 01 Banská Bystrica, Slovakia ondrejs@savbb.sk, 2 Žilinská Univerzita v Žiline, Univerzitná 8215/1, 010 26 Žilina Abstract: It is possible to construct multiclass classiﬁ- cation models from binary classiﬁers trained in pairwise (one-on-one) manner. Important examples of models cre- ated in this way are support vector machines applied to multiclass problems. In this work we examine feasibility of this approach for convolutional neural networks. We examine multiple ways to train pairwise classiﬁcation net- works for MNIST dataset, and multiple ways to combine them into a multiclass classiﬁer for MNIST classiﬁcation problem. Our experimental results show deﬁnite promise of this approach, especially in reducing complexity of deep neural networks. An important unresolved question of our approach is how to choose the best pairwise network to include into a full multi-class model. Keywords: MNIST, convolutional network, pairwise cou- pling, one-on-one classiﬁcation, binary classiﬁcation, dropout 1 Introduction Deep neural networks are currently the most powerful type of classiﬁers applicable for a multitude of machine learn- ing problems [1]. Perhaps their biggest drawback is their complexity, which manifests in multiple ways. First, they require preparation and use of large datasets to attain the best precision [2]. They need a lot of specialized com- puting power for training [3] and the training process may take a long time. Finally, the classiﬁcation process is ob- scured by their complexity, which makes it harder to un- derstand their weaknesses and guarantee performance on unseen instances. In this article we consider the question, whether the classiﬁcation process using deep neural net- works could be made more modular, alleviating the draw- backs resulting from the complexity. The approach is inspired by research on support vec- tor machines. Support vector machines were proposed by Vapnik as a general purpose classiﬁer [4]. They are still popular to this date for a variety of classiﬁcation tasks. Since SVM work by dividing feature space (or its embed- ding into a higher-dimensional space [5]) into two parts by a hyperplane, they are naturally suited for two-class Copyright c 2019 for this paper by its authors. Use permitted un- der Creative Commons License Attribution 4.0 International (CC BY 4.0). problems. Multi-class classiﬁcation with SVM is accom- plished by training SVM for pairs of classes, then ﬁtting sigmoid to obtain pairwise probabilities [6] and ﬁnally us- ing a pairwise coupling method to obtain multiclass pre- diction probabilities [7]. The same approach can be applied to deep neural net- works. An additional simpliﬁcation compared to SVM is that the step of ﬁtting sigmoid is not necessary, since neu- ral networks typically use soft-max for their ﬁnal layers that directly output prediction probabilities. On the other hand, compared to SVM, the process of training of neural networks allows for many more hyperparameters. In this paper we will carry out the process on MNIST digit classiﬁcation task [8] illustrating the potential and possible pitfalls of the approach (Figure 1). Even for the basic MNIST task we had to severely limit the number of investigated training procedures. A key restriction we adopted is that we trained the two-class networks only with examples belonging to the corresponding two classes. Such approach, dubbed ‘ignoring irrelevant alternatives’, promises to speed up the training process for the two-class networks by reducing the size of the training dataset as well as to cut the time needed to train the whole multi-class model. Moreover, it is philosophically consistent with the assumption of the independence of irrelevant alternatives in the softmax layer commonly used in neural networks. Our restriction, and pairwise decomposition of multi- class classiﬁcation itself are not without potential prob- lems. A key issue is that of extrapolating prediction prob- abilities to classes that a two-class classiﬁer has never seen during training (e.g. the insight of G. Hinton described in the work of Hastie and Tibshirani [9]). Another potential problem to guard against is that the proposed classiﬁca- tion scheme may require much more parameters since as much as 10 x 9 / 2 = 45 neural networks need to be trained instead of one. 2 Methodology outline MNIST dataset of handwritten digits (zero to nine) is a widely used benchmark task in which convolutional net- works proved quite successful [8]. It consists of 60000 training samples and 10000 testing samples. Throughout this work we will use 8-layer feed-forward networks de-