Intelligent Data Analysis 22 (2018) 1115–1126 1115 DOI 10.3233/IDA-173536 IOS Press Finite population Bayesian bootstrapping in high-dimensional classification via logistic regression Shaho Zarei, Adel Mohammadpour * and Saeid Rezakhah Department of Statistics, Faculty of Mathematics and Computer Science, Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran Abstract. When the sample size is equal or less than the number of covariates, traditional logistic regression is plugged with degenerates and wild behavior. Therefore, classification results are not reliable. We use finite population Bayesian bootstrapping for resampling, such that the new sample size becomes greater than the number of covariates. Combining original samples and the mean of simulated data, and also applying sufficient dimension reduction method, we introduce a new algorithm based on traditional logistic regression for high-dimensional binary classification. Then, we compare the proposed algorithm with the regularized logistic models and other popular classification algorithms using both simulated and real data. Keywords: Finite population Bayesian bootstrapping, logistic regression classifier, high-dimensional data classification, sliced inverse regression 1. Introduction Classification is one of the most important methods in the multivariate statistical analysis and the supervised learning technique. The aim of classification is to find classes of new data, using a proper classifier, which is learned from the data with known labels. In many scientific areas, such as biology and medical science, we face High-Dimensional Data (HDD), i.e., data with the number of variables often larger than the sample size. In statistical problems, the large number of variables causes some difficulties in fitting the model, estimating parameters, optimizing the objective functions and analyzing numerically. These phenomena are referred to as the curse of dimensionality [3]. In these situations, traditional classifiers such as logistic regression, despite their good accuracy, are not usable. For example, in such a case, logistic regression is plugged with degeneracy and wild behavior. That is, its classification results are not reliable. Furthermore, other well-known classifiers such as Naive Bayes (NB) and K-Nearest Neighbors (KNN) [8] have restrictive assumptions in the classification of HDD. In NB, we should calculate the posterior distribution of the response variable, given covariates. However, due to the curse of dimension- ality and noise accumulation, we need to accept the restricted assumption that is conditional indepen- dence of data. Furthermore, KNN is simple to learn, still suffers from overfitting for HDD classification, especially when the sample size is too small. * Corresponding author: Adel Mohammadpour, Department of Statistics, Faculty of Mathematics and Computer Science, Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran. Tel.: +98 21 64542533; E-mail: adel@aut.ac.ir. 1088-467X/18/$35.00 c 2018 – IOS Press and the authors. All rights reserved