1 Paper 1967-2018 Prediction and Interpretation for Machine Learning Regression Methods D. Richard Cutler, Utah State University ABSTRACT The last 30 years has seen extraordinary development of new tools for the prediction of numerical and binary responses. Examples include the LASSO and elastic net for regularization in regression and variable selection, quantile regression for heteroscedastic data, and machine learning predictive method such as classification and regression trees (CART), multivariate adaptive regression splines (MARS), random forests, gradient boosting machines (GBM), and support vector machines (SVM). All these methods are implemented in SAS®, giving the user an amazing toolkit of predictive methods. In fact, the set of available methods is so rich it begs the question, “When should I use one or a subset of these methods instead of the other methods?” In this talk I hope to provide a partial answer to this question through the application of several of these methods in the analysis of several real datasets with numerical and binary response variables. INTRODUCTION Over the last 30 years there has been substantial development of regression methodology for regularization of the estimation in the multiple linear regression model and for carrying out non- linear regression of various kinds. Notable contributions in the area of regularization include the LASSO (Tibshirani 1996), the elastic net (Zou and Hastie 2005), and least angle regression (Effron et al. 2002) which is both a regularization method and a series of algorithms that can be used to efficiently compute LASSO and elastic net estimates of regression coefficients. An early paper on non-linear regression via scatter plot smoothing and the alternating conditional expectations (ACE) algorithm is due to Breiman and Friedman (1985). Hastie and Tibshirani (1986) extend this approach to create generalized additive models (GAM). An alternative approach to non-linear regression using binary partitioning are regression trees (Breiman et al. 1984). Multivariate adaptive regression splines (MARS) (Friedman 1991) extended generalized linear and generalized additive models in the direction of modeling interactions, and considerable research of tree methods, notably ensembles of trees, resulted in the development of gradient boosting machines (GBM) (Friedman 2000) and random forests (Breiman 2001). A completely different approach, based on non-linear projections is support vector machines, the modern development of which is usually credited to Vapnik (1995) and Cortes and Vapnik (1995). All of the methods listed above, and more, are implemented in SAS and other statistical packages giving statisticians a very large toolkit for analyzing and understanding data with a continuous (interval valued) response variable. In SAS using the LASSO or fitting a regression tree or random forests is no harder than fitting an ordinary multiple regression with some traditional variable selection. The LASSO has rapidly become a “standard” method for variable selection in regression, and all of these methods lend themselves to larger datasets, where there is a lot of information and statistical significance does not make sense. In this paper I hope to illustrate the use of some of these methods for the analysis of real datasets.