Evolutionary Mining for Multivariate Associations in Large Time-Varying data sets: a Healthcare Network Application Narine Manukyan University of Vermont (UVM) Burlington, VT, USA Narine.Manukyan@uvm.edu Margaret J. Eppstein University of Vermont (UVM) Burlington, VT, USA Maggie.Eppstein@uvm.edu Jeffrey D. Horbar Vermont Oxford Network,UVM Burlington, VT, USA Jeffrey.Horbar@uvm.edu ABSTRACT We introduce a new method for exploratory analysis of large data sets with time-varying features, where the aim is to automatically discover novel relationships between features (over some time period) that are predictive of any of a num- ber of time-varying outcomes (over some other time period). Using a genetic algorithm, we co-evolve (i) a subset of pre- dictive features, (ii) which attribute will be predicted (iii) the time period over which to assess the predictive features, and (iv) the time period over which to assess the predicted attribute. After validating the method on 15 synthetic test problems, we used the approach for exploratory analysis of a large healthcare network data set. We discovered a strong association, with 100% sensitivity, between hospital partici- pation in multi-institutional quality improvement collabora- tives during or before 2002, and changes in the risk-adjusted rates of mortality and morbidity observed after a 1-2 year lag. The results provide indirect evidence that these quality improvement collaboratives may have had the desired eﬀect of improving health care practices at participating hospitals. The proposed approach is a potentially powerful and general tool for exploratory analysis of a wide range of time-series data sets. Categories and Subject Descriptors I.2.6 [Computing Methodologies]: Artiﬁcial Intelligence— knowledge acquisition ; I.5.m [Computing Methodologies]: Pattern Recognition—Miscellaneous General Terms Algorithms, Experimentation, Design Keywords Genetic algorithms, exploratory multivariate data analysis, time series. 1. INTRODUCTION The rapid growth of technology has facilitated widespread collection and storage of vast amounts of time-varying data. This data undoubtedly contains a wealth of potentially valu- able information regarding relationships between various time- varying features and outcomes. However, the very size of Copyright is held by the author/owner(s). GECCO’12, July 7-11, 2012, Philadelphia, Pennsylvania, USA. ACM 978-1-4503-1177-9/12/07. these databases is an impediment to knowledge discovery, creating a need for automated exploratory analysis tools [2]. The challenge of exploratory data analysis is that one should not only identify features that are associated in a potentially non-linear manner, but also determine which outcome(s) those features are associated with and time dependent as- pects of the association. In this paper we develop a general tool that uses genetic algorithms and classiﬁers to ﬁnd novel multivariate associations between features in time varying data. We ﬁrst validate the approach using synthetic data and then apply it to a real world problem of ﬁnding poten- tial associations between 18 diﬀerent patient outcomes over a 10 year period and diﬀerent hospital collaborations in the Vermont Oxford Network (VON), a worldwide network of neonatal intensive care units designated to promote dissem- ination of eﬀective healthcare practices. This data set is dif- ﬁcult to analyze because it includes a large number of time- varying patient outcomes as well as several diﬀerent types of time-varying interactions between member hospitals. It is not clear which interactions (if any) may be associated with which changes in patient outcomes or, if such an association exists, over which time frames the association is strongest. Genetic algorithms (GAs) are known to be eﬀective for fea- ture selection [1], but we are not aware of any research that does feature selection and predicted attribute selection for time series data.The unique contribution of this study is to apply a GA for simultaneous selection of features, feature time frames, which attribute to predict, and over what time period to predict it. 2. METHODS We use a GA to simultaneously estimate four important aspects of multivariate time-series analysis: (i) a subset of features to be used as input into some sort of statistical pre- dictor, (ii) which attribute we can best predict from these features, (iii) a dividing year that partitions the time-series, and (iv) a time lag to be added to the dividing year to in- dicate a possible delay in outcome change. Fitness is deter- mined by measuring how well the values of the selected fea- tures before the dividing year can be used to predict changes in the selected attribute before the dividing year and af- ter the dividing year + lag. For brevity, we refer to this method as GAMET (Genetic Algorithm for Multivariate Ex- ploration of Time-varying data). For feature selection, we are using binary ﬂags that indicate whether the given fea- ture is included in the ﬁnal features subset or not. To evolve the time series component we evolve the dividing year and lag, both of which are represented as gray-coded integers