An Examination of Multivariate Time Series Hashing with Applications to Health Care David C. Kale, ∗† Dian Gong, ∗ Zhengping Che, ∗ Yan Liu, Gerard Medioni Computer Science Department University of Southern California Los Angeles, CA 90089 {dkale,diangong,zche}@usc.edu {yanliu.cs,medioni}@usc.edu Randall Wetzel, Patrick Ross Laura P. and Leland K. Whittier Virtual Pediatric Intensive Care Unit Children’s Hospital Los Angeles Los Angeles, CA 90027 {rwetzel,pross}@chla.usc.edu Abstract—As large-scale multivariate time series data become increasingly common in application domains, such as health care and traffic analysis, researchers are challenged to build efficient tools to analyze it and provide useful insights. Similarity search, as a basic operator for many machine learning and data mining algorithms, has been extensively studied before, leading to several efficient solutions. However, similarity search for multivariate time series data is intrinsically challenging because (1) there is no conclusive agreement on what is a good similarity metric for multivariate time series data and (2) calculating similarity scores between two time series is often computationally expensive. In this paper, we address this problem by applying a generalized hashing framework, namely kernelized locality sensitive hashing, to accelerate time series similarity search with a series of representative similarity metrics. Experiment results on three large-scale clinical data sets demonstrate the effectiveness of the proposed approach. I. I NTRODUCTION Multivariate time series data are becoming ubiquitous and big. Nowhere is this trend more obvious than in health care, with the growing adoption of electronic health records (EHRs) systems. According to a 2009 survey, hospital intensive care units (ICUs) in the United States (US) treated nearly 55,000 patients per day, 1 generating digital health databases con- taining millions of individual measurements, many of which constitute multivariate time series. Clinicians naturally want to utilize these data in new and innovative ways to aid in the diagnosis and treatment of new patients. An increasingly popular idea is to search these databases to find “patients like mine,” i.e., past cases that are similar to the present one [1]. This classic data mining task, known as similarity search, must be both accurate and fast, which depends crucially on two choices: representation and similarity measure. For traditional data types (e.g., structured data, free text, images, etc.), the standard approach is to define a set of features that we extract from each object in our database and then * These authors contributed equally. † Also affiliated with the VPICU and CHLA. 1 From the American Hospital Association Hospital Statistics survey con- ducted in 2009 and published in 2011 by Health Forum, LLC, and the American Hospital Association. apply straightforward measures of similarity (e.g., Euclidean distance) to these. This has been applied to time series data [2], but designing good features can be difficult and time- consuming. Researchers have shown empirically that the best represen- tation for time series is often the data themselves, combined with specialized similarity measures [3]. The classic example is dynamic time warping (DTW), an extension of Euclidean distance that permits nonlinear warping along the temporal axis in order to find the optimal alignment between two time series [4]. The choice in time series similarity measures ranges from simple to complex approaches based on fitting parametric models [5]. Indeed, there has been an explosion in the number and variety of time series similarity and distance metrics proposed in the literature over the last decade [6] [7] [8] [9]. There are two critical things to observe about these com- peting similarity metrics, especially if we want to use them to implement fast similarity search for multivariate time series: that different similarities work best for different data and problems; and that the most effective similarity measures are often computationally expensive and ill-suited to large scale search. The first point was best demonstrated by the thorough empirical evaluation in [9], in which no one metric worked best for all data and problems. Choosing the right similarity (much like designing good features) requires experience, intuition, and experimentation. The second point is more nuanced; some approaches can often be sped up by using a combination of good engineering and heuristics (e.g., DTW [10]). However, these speed-ups do not generalize beyond specific tasks or to other measures that we may want to use. In this paper, we investigate a general solution that applies to a large class of time series similarities: kernelized hashing. Hashing has been used to build fast search and retrieval over massive databases of text and images [11] [12] [13]. It utilizes one or more hash functions to map the input data to a fixed- length representation (typically a binary code), which then can be used as indexes in a large-scale storage architecture. Well- constructed hash functions will assign similar codes to similar objects, allowing us to store them together and to find them with just a quick lookup.