Evolutionary Mining for Multivariate Associations in Large Time-Varying data sets: a Healthcare Network Application Narine Manukyan University of Vermont (UVM) Burlington, VT, USA Narine.Manukyan@uvm.edu Margaret J. Eppstein University of Vermont (UVM) Burlington, VT, USA Maggie.Eppstein@uvm.edu Jeffrey D. Horbar Vermont Oxford Network,UVM Burlington, VT, USA Jeffrey.Horbar@uvm.edu ABSTRACT We introduce a new method for exploratory analysis of large data sets with time-varying features, where the aim is to automatically discover novel relationships between features (over some time period) that are predictive of any of a num- ber of time-varying outcomes (over some other time period). Using a genetic algorithm, we co-evolve (i) a subset of pre- dictive features, (ii) which attribute will be predicted (iii) the time period over which to assess the predictive features, and (iv) the time period over which to assess the predicted attribute. After validating the method on 15 synthetic test problems, we used the approach for exploratory analysis of a large healthcare network data set. We discovered a strong association, with 100% sensitivity, between hospital partici- pation in multi-institutional quality improvement collabora- tives during or before 2002, and changes in the risk-adjusted rates of mortality and morbidity observed after a 1-2 year lag. The results provide indirect evidence that these quality improvement collaboratives may have had the desired effect of improving health care practices at participating hospitals. The proposed approach is a potentially powerful and general tool for exploratory analysis of a wide range of time-series data sets. Categories and Subject Descriptors I.2.6 [Computing Methodologies]: Artificial Intelligence— knowledge acquisition ; I.5.m [Computing Methodologies]: Pattern Recognition—Miscellaneous General Terms Algorithms, Experimentation, Design Keywords Genetic algorithms, exploratory multivariate data analysis, time series. 1. INTRODUCTION The rapid growth of technology has facilitated widespread collection and storage of vast amounts of time-varying data. This data undoubtedly contains a wealth of potentially valu- able information regarding relationships between various time- varying features and outcomes. However, the very size of Copyright is held by the author/owner(s). GECCO’12, July 7-11, 2012, Philadelphia, Pennsylvania, USA. ACM 978-1-4503-1177-9/12/07. these databases is an impediment to knowledge discovery, creating a need for automated exploratory analysis tools [2]. The challenge of exploratory data analysis is that one should not only identify features that are associated in a potentially non-linear manner, but also determine which outcome(s) those features are associated with and time dependent as- pects of the association. In this paper we develop a general tool that uses genetic algorithms and classifiers to find novel multivariate associations between features in time varying data. We first validate the approach using synthetic data and then apply it to a real world problem of finding poten- tial associations between 18 different patient outcomes over a 10 year period and different hospital collaborations in the Vermont Oxford Network (VON), a worldwide network of neonatal intensive care units designated to promote dissem- ination of effective healthcare practices. This data set is dif- ficult to analyze because it includes a large number of time- varying patient outcomes as well as several different types of time-varying interactions between member hospitals. It is not clear which interactions (if any) may be associated with which changes in patient outcomes or, if such an association exists, over which time frames the association is strongest. Genetic algorithms (GAs) are known to be effective for fea- ture selection [1], but we are not aware of any research that does feature selection and predicted attribute selection for time series data.The unique contribution of this study is to apply a GA for simultaneous selection of features, feature time frames, which attribute to predict, and over what time period to predict it. 2. METHODS We use a GA to simultaneously estimate four important aspects of multivariate time-series analysis: (i) a subset of features to be used as input into some sort of statistical pre- dictor, (ii) which attribute we can best predict from these features, (iii) a dividing year that partitions the time-series, and (iv) a time lag to be added to the dividing year to in- dicate a possible delay in outcome change. Fitness is deter- mined by measuring how well the values of the selected fea- tures before the dividing year can be used to predict changes in the selected attribute before the dividing year and af- ter the dividing year + lag. For brevity, we refer to this method as GAMET (Genetic Algorithm for Multivariate Ex- ploration of Time-varying data). For feature selection, we are using binary flags that indicate whether the given fea- ture is included in the final features subset or not. To evolve the time series component we evolve the dividing year and lag, both of which are represented as gray-coded integers