Article Bootstrap confidence intervals for the optimal cutoff point to bisect estimated probabilities from logistic regression Zheng Zhang, 1,2 Xianjun Shi, 3 Xiaogang Xiang, 3 Chengyong Wang, 4 Shiwu Xiao 4 and Xiaogang Su 2 Abstract To classify estimated probabilities from a logistic regression model into two groups (e.g., yes or no, disease or no disease), the optimal cutoff point or threshold is crucial. While various methods have been proposed for estimating such a threshold, statistical inference is not generally available. To tackle this issue, we put forward several bootstrap based methods, including the conventional nonparametric bootstrap standard errors and the quantile interval. Special emphasis is placed on a more precise bagging estimator of the optimal cutoff point, for which a confidence interval can be obtained via the recently proposed infinitesimal jackknife method. We investigate the empirical performance of the proposed methods by simulation and illustrate their use via the analysis of a fertility data set concerning seminal quality prediction. Keywords Classification, logistic regression, optimal cutoff point, receiver operating characteristic curve, Youden index 1 Introduction Logistic regression is a fundamental modeling tool in biomedical and other application fields. Its wide popularity stems from the sound generalized linear models 1 (generalized linear models) theory, the meaningful interpretation of the regression parameters via odds ratios, the efficient computation by virtue of a convex optimization formulation, and its great flexibility and capability in terms of addressing various modeling issues such as high- dimensional predictors of categorical and continuous types, interactions, nonlinearity, model diagnostics, and variable selection. Consider a typical binary classification setting where the available data D ¼ fðx i , y i Þ : i ¼ 1, ... , ng consist of n independent and identically distributed (IID) copies of the p-dimensional predictor vector x 2 R p and the binary response y 2f0, 1g: With logistic regression, the conditional probability i ¼ ðx i Þ¼ Prðy i ¼ 1 j x i Þ is modeled as a linear combination of x i through a logit link function, given by logitð i Þ¼ log i 1 i ¼ 0 þ x T i b 1 ð1Þ where b ¼ð 0 , b T 1 Þ T is the vector of regression coefficients. Estimation of b in Model (1) is efficiently done within the maximum likelihood estimation (MLE) framework via Fisher scoring. The output from logistic regression consists of estimated probabilities ^ i given by ^ i ¼ expitð ^ 0 þ x T i b b 1 Þ ð2Þ 1 University of Tennessee, Knoxville, TN, USA 2 University of Texas at El Paso, USA 3 Wuhan Textile University, Hubei Sheng, China 4 Hubei University of Arts & Science, Hubei Sheng, China Corresponding author: Xianjun Shi, School of Mathematics and Computer Science, Wuhan Textile University, Hubei Province 430073, China. Email: shixjfanh@wtu.edu.cn Statistical Methods in Medical Research 0(0) 1–13 ! The Author(s) 2019 Article reuse guidelines: sagepub.com/journals-permissions DOI: 10.1177/0962280219864998 journals.sagepub.com/home/smm