Modeling Hospitalization Outcomes with Random Decision Trees and Bayesian Feature Selection Thomson Van Nguyen † Centre for Mathematical Sciences University of Cambridge Wilburforce Road Cambridge CB3 0WA, UK Email: tn248@cam.ac.uk Bhubaneswar Mishra † Courant Institute of Mathematical Sciences New York University 715 Broadway, 10th Floor New York, NY 10003, USA Email: mishra@cs.nyu.edu Abstract—We propose several serial and highly parallelized approaches to modeling causality between hospitalization and healthcare observations, using data from the Heritage Health Prize competition. As any set of predictors culled from the raw dataset will be very prone to overfitting, we propose some feature selection methods to shrink to a subset of predictors that best represent the data available. We then compare the effectiveness of all our approaches, first against a self-designated test subset of the data, and then against the contest data used for evaluation of ranking and prizes. Our best implementation approach with a RMSLE (root mean squared log error) score of 0.462678 represents a linear blend of 20 random decision tree models with feature selection. This RMSLE score is 0.00552 away from the current leading team. I. I NTRODUCTION The Heritage Health Prize is a data-mining competition sponsored by the Heritage Provider Network, a physician network in Southern California, and administered by Kaggle, a company specializing in administering data-mining compe- titions. The goal of the prize is to develop a mathematical model that accurately predicts the number of days a patient will be hospitalized, given three years’ worth of anonymized patient data. Once known, the hopeful goal is that health care providers can develop new care plans and strategies to reach patients before emergencies occur, thereby reducing the number of unnecessary hospitalizations and reducing overall administrative costs. The United States has been slow to adopt Electronic Health Records until 2009, when it saw a surge in EHR usage among healthcare providers in the US as a result of the Health Information Technology for Economic and Clinical Health (HITECH) Act, incentivizing EHR adoption as part of the economic stimulus package passed by US congress. Ultimately, these longitudinal records from these datasets will allow insight into the health of large populations across their lifespan, thus allowing one to intuit not only observations on patterns of activity, but likely causes of these patterns. In this paper, we introduce the Heritage Health Prize as a data-mining competition and outline various regression ap- proaches. We describe the datasets given in detail and the This work was supported by NYU SEED Grant xxxxxxx. problem formulation in Section II. In Section III, we outline three different approaches to the problem: an ordinary least squares linear regression with three simple predictors, a deci- sion tree model created through recursive partitioning, and a random ensemble classifier (random forests). The latter model can be run in a highly parallelized environment using GPUs for modeling, and this is briefly described in Section III-C1. Section IV describes Bayesian feature selection as a method to increase model accuracy and reduce overfitting. All of these model approaches are evaluated in Section V, and future work is outlined in Section VI. II. DATA The data provided by the Heritage Provider Network in- cludes the following for all three years: • A list of 120k members in the database, sorted by a unique, anonymized MemberID, gender, and age, • A claims table containing 1.4 million medical claims made by the members which includes data on the primary diagnosis, physician specialty, Charlson Co-morbidity In- dex, and anonymized IDs for their primary care physician, vendor (company issuing the bill), and service provider, • A labs table containing the number of lab tests performed, • A drug prescription table containing the number of pre- scriptions filled by members, and • A table of hospitalization days for members in year 1, 2, and 3, with the goal of being able to predict year 4’s hospitalization days. This table is right censored at 15, meaning the only values in this table are in [0, 15]. Using an approach to preprocessing the data in consensus with the community of contestants on Kaggle, we formatted the data in a matrix X A consisting of 78, 049 patient rows with claims in year 1 and observed values for hospitalization in year 2. The columns were individual counts for each specialist and general practitioner visits, primary condition groups, co- morbidity index scores, and various composite predictors created from covariate analyses of the count predictors. The outcome vector y A contained the number of hospitalization days in year 2 for each patient in X A . The count predictors were normalized by a rank-preserving Box-Cox transformation to alleviate heteroscedasticity: