Analysis of Robust Measures in Random Forest Regression MAJ John R. Brence, Ph.D. Adjunct Assistant Professor Department of Systems Engineering United States Military Academy West Point, NY 10996 845-304-6416 john.brence@usma.edu Donald E. Brown, Ph.D. Department Chair and Professor Department of Systems and Information Engineering University of Virginia Charlottesville, VA 22903 434-924-5393 brown@virginia.edu ABSTRACT Analysis of robust measures in Random Forest Regression (RFR) is an extensive empirical analysis on a new method, Robust Random Forest Regression (RRFR). The application and analysis of this tree-based method has yet to be addressed and may provide additional insight in modeling complex data. Our approach is based on the RFR with two major differences ~ the introduction of robust prediction and error statistic. The current methodology utilizes the node mean for prediction and mean squared error (MSE) to derive the in-node and overall error. Herein, we introduce and assess the use of a median for prediction and mean absolute deviation (MAD) to derive the in-node and overall error. Extensive research has shown that the median is a better prediction of the centrality of the distribution in the presence of large or unbounded outliers because the median inherently ignores these outliers basing its prediction on the ordered, central value(s) of the data. We have shown that RRFR performs well under extreme conditions; with datasets that include unbounded outliers or heteroscedastic conditions. KEY WORDS Random Forest, Outlier, Robust Statistics, Regression, Tree-based Methods 1 INTRODUCTION The use of robust measures provides an interesting basis for this study. Robust statistics is generally thought of as the statistics of approximate parametric models (Hampel et.al.,1986). Using robust statistics allows us to explore relatively dirty datasets without requiring the somewhat archaic method of removing outliers or strange observations from a dataset prior to modeling. Application of robust measures to nonparametric models allows us to forgo our strict adherence to the usual statistical assumptions such as normality and linearity; however, nonparametric methods maintain a rather weak yet stringent adherence to the continuity of distribution or independence assumption (Hampel et.al., 1986). In this sense, robust theory provides us the ability to be creative in our approach and assists in determining the usefulness of applying robust measures to the RFR algorithm. This is extremely valuable because many of the nonparametric algorithms used today are naturally robust in some instances. In this vein, if a 1