Intelligent Data Analysis 22 (2018) 1115–1126 1115 DOI 10.3233/IDA-173536 IOS Press Finite population Bayesian bootstrapping in high-dimensional classiﬁcation via logistic regression Shaho Zarei, Adel Mohammadpour * and Saeid Rezakhah Department of Statistics, Faculty of Mathematics and Computer Science, Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran Abstract. When the sample size is equal or less than the number of covariates, traditional logistic regression is plugged with degenerates and wild behavior. Therefore, classiﬁcation results are not reliable. We use ﬁnite population Bayesian bootstrapping for resampling, such that the new sample size becomes greater than the number of covariates. Combining original samples and the mean of simulated data, and also applying sufﬁcient dimension reduction method, we introduce a new algorithm based on traditional logistic regression for high-dimensional binary classiﬁcation. Then, we compare the proposed algorithm with the regularized logistic models and other popular classiﬁcation algorithms using both simulated and real data. Keywords: Finite population Bayesian bootstrapping, logistic regression classiﬁer, high-dimensional data classiﬁcation, sliced inverse regression 1. Introduction Classiﬁcation is one of the most important methods in the multivariate statistical analysis and the supervised learning technique. The aim of classiﬁcation is to ﬁnd classes of new data, using a proper classiﬁer, which is learned from the data with known labels. In many scientiﬁc areas, such as biology and medical science, we face High-Dimensional Data (HDD), i.e., data with the number of variables often larger than the sample size. In statistical problems, the large number of variables causes some difﬁculties in ﬁtting the model, estimating parameters, optimizing the objective functions and analyzing numerically. These phenomena are referred to as the curse of dimensionality [3]. In these situations, traditional classiﬁers such as logistic regression, despite their good accuracy, are not usable. For example, in such a case, logistic regression is plugged with degeneracy and wild behavior. That is, its classiﬁcation results are not reliable. Furthermore, other well-known classiﬁers such as Naive Bayes (NB) and K-Nearest Neighbors (KNN) [8] have restrictive assumptions in the classiﬁcation of HDD. In NB, we should calculate the posterior distribution of the response variable, given covariates. However, due to the curse of dimension- ality and noise accumulation, we need to accept the restricted assumption that is conditional indepen- dence of data. Furthermore, KNN is simple to learn, still suffers from overﬁtting for HDD classiﬁcation, especially when the sample size is too small. * Corresponding author: Adel Mohammadpour, Department of Statistics, Faculty of Mathematics and Computer Science, Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran. Tel.: +98 21 64542533; E-mail: adel@aut.ac.ir. 1088-467X/18/$35.00 c  2018 – IOS Press and the authors. All rights reserved