Towards Scalable Quantile Regression Trees Harish S. Bhat Applied Mathematics Unit UC Merced Merced, USA hbhat@ucmerced.edu Nitesh Kumar Skytree, Inc. San Jose, USA nitesh@skytree.net Garnet J. Vaz Microsoft Bellevue, USA gavaz@microsoft.com Abstract—We provide an algorithm to build quantile re- gression trees in O(N log N) time, where N is the number of instances in the training set. Quantile regression trees are regression trees that model conditional quantiles of the response variable, rather than the conditional expectation as in standard regression trees. We build quantile regression trees by using the quantile loss function in our node splitting criterion. The performance of our algorithm stems from new online update procedures for both the quantile function and the quantile loss function. We test the quantile tree algorithm in three ways, comparing its running time against implementations of standard regression trees, demonstrating its ability to recover a known set of nonlinear quantile functions, and showing that quantile trees yield smaller test set errors (computed using mean absolute deviation) than standard regression trees. The tests include training sets with up to 16 million instances. Overall, our results enable future use of quantile regression trees for large-scale data mining. Keywords-regression trees; quantile regression; online algo- rithm I. I NTRODUCTION Decision trees are one of the most widely used methods in data mining. Trees enjoy various advantages, among which are interpretability, a natural ability to handle both numerical and categorical predictors, and well-developed methods to deal with missing data. Trees are typically constructed using recursive partitioning algorithms, which can be efficient and scalable. In the present work, we focus on quantile regression trees, which are regression trees designed to model conditional quantiles of the response variable. This is in contrast to standard regression trees, which model the conditional expectation. Our primary contribution is an algorithm that constructs quantile regression trees in O(N log N ) time, where N is the number of instances in the training set. We are interested in quantile regression trees for two reasons. The first reason has to do with robustness. When tree models are applied to regression problems, the most widely used splitting criterion employs an ordinary least squares (OLS) loss function. To generate trees that are more robust to outliers in the response variable, Breiman et al. [1, Chap. 8] suggested a splitting criterion based on least absolute deviation (LAD). In this approach, when we arrive at a leaf node, the predicted value will be the median of the response variable for the instances associated with the leaf. In this paper, we generalize Breiman’s framework to arrive at quantile trees. The splitting criterion uses a tilted absolute value loss function (2) that, in a natural way, allows us to develop a model for the τ th quantile of the response variable. The LAD tree is included as a special case. The second reason for our interest in quantile regression trees is a desire for more informative models than one obtains with OLS regression trees. Let X be an N ×p matrix of predictors, where each row is a different instance and each column is a different predictor. Let Y , an N × 1 vector, be the corresponding response variable. We regard each row of (X,Y ) as an independent sample from the random vector (ξ,η). Let F η|ξ (a|b)= P (η ≤ a|ξ = b) denote the conditional cumulative distribution function (CDF). Then the quantile regression tree with parameter τ is an approximation of the function y =Φ τ (x) that satisfies F η|ξ (Φ τ (b)|b)= τ. In short, Φ τ is the inverse of the conditional CDF, and the quantile tree is a piecewise constant approximation of Φ τ . Developing quantile trees for a range of values for τ , we can approximate the conditional CDF of the response given the predictors. OLS regression trees seek to model E[η|ξ]. The difference between quantile and OLS regression trees, there- fore, can be understood as the difference between estimating the conditional distribution and conditional expectation. The additional information in the distribution has proven useful in various problems [2]–[4]. A classic paper on gradient boosting [5] states: “Squared error loss is much more rapidly updated than mean- absolute-deviation when searching for splits during the tree building process.” We argue that it is for this reason that, despite the potential advantages outlined above, neither LAD nor quantile regression trees enjoy widespread use on massive data sets. In this paper, we detail an algorithm for updating the quantile loss function ρ QT that enables quantile regression trees to be built in O(N log N ) time;