Hierarchy Scan: A Hierarchical Similarity Search Algorithm for Databases of Long Sequences Chung-Sheng Li, Philip zyxwv JIHGF S. Yu, and Vittorio Castelli' IBM T. J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598 zy PO Abstract zyxwvu MLKJIH W e present a hierarchical algorithm, Hierar- chyScan, that efficiently locates one-dimensional sub- sequences w i t h i n a collection of sequences of arbi- trary length. T h e subsequences identified by Hierar- chyScan m a t c h a given template p a t t e r n in a scade- and phase-independent fashion. T h e idea zyxwvuts is t o per- f o r m correlation between t h e stored sequences and the template in t h e transformed d o m a i n hierarchically. O n l y those subsequences whose m a z i m u m correlation value i s higher t h a n a predefined threshold will be se- lected. T h e performance of t h i s approach as compared t o t h e sequential scanning and a n order-of-magnitude speedup i s observed. 1 Introduction Temporal or spatial-temporal data constitutes a large portion of data stored in computers [I, 21.Ex- amples of this type of database include: (1 time series such as stock price index, the volume an zyxwvut CBA d revenue of product sales, insurance claims, etc.; (2) medical data- bases such as 1D signals (e.g., EKG), 2D images (e.g., X-rays), and 3D images (e.g., MRI, CT, and PET); (3) multimedia databases which contain audio, im- age, and video data; (4) multispectral satellite image databases. Searching for similar patterns in a tempo- ral or spatial-temporal database is essential in many data mining operations [4, 5, zyxwvut GFEDCB 91 in order to discover and predict the risk, causality, and trend associated with a specific pattern. Typical queries for this type of database include identifying companies with simi- lar growth patterns, products with similar selling pat- terns, stocks with similar price movement, images with similar weather patterns, geological features, environ- mental pollutions, or astrophysical patterns. These queries invariably require similarity matches as op- posedto exact matches. Two types of queries are usually necessary in various data mining operations: zyxwvu HGFED 0 Object-relative similarity query (i.e., range query or similarity query) in which a search is performed on a collection of objects to find the ones that are within user-defined distance from the queried object. 0 All-pair similarity query the objective is to find *This zyxwvutsrq work was funded in part by grant no. NASA/CAN NCC5-101. that are within a user-specified distance from each other. In this paper, we shall consider the firsttype of queries, where the emphasis is on databases with very long sequences. Significant pro ress has been recently made in se- quence matchingf5, 6, 7, 81. Two types of similarity queries for temporal data have emerged thus far: whole matching [5] in which the target sequence and the se- quences in the database have the same length; subse- quence matching 61 in which the target sequence could match can occur at any arbitrary point. The straight- forward approach for whole matching is to consider all of the data points of a sequence simultaneously. A fast whole matching method generalizing this idea to sequence matching is proposed in [5], where the simi- larity between a sequence in the database and a target sequence is measured by the Euclidean distance be- tween the features extracted from these two sequences in the Fourier domain. Extending the above concept, an innovative approach is proposed in [6] to match subsequences by generating the first few Fourier coef- ficients of all possible subsequences of a given length for each sequence. Fourier transformation is by no means the best method of feature extraction. It is known that the a p r i o r i relative importance of the features can be optimally determined from the singular value decom- position (SVD) or the Karhunen-Loeve transforma- tion on the covariance matrix of the collection of the time sequences [ll]. A fast heuristic algorithm which approximates this dimensionality reduction process is proposed in [7]. Even with the significant dimensional- ity reduction resulting from the algorithm proposed in [7] and the compression of the representations of the subsequences in the feature space using the method proposed in [6], generating all subsequences from each time series is still a daunting task for a database of long sequences. This paper proposes an enhancement of the fea- ture extraction and matching method discussed in [6], called HierarchyScan uses the correlation coefficient as an alternative similarity measure between the target sequence and the stored sequence, and performs an adaptive scan on the extracted features of the stored sequences, based on the target sequence. The idea is to select first the subset of features with the greatest be shorter than t h e sequences in the database and the 1063-6382196 $5.00 zyxwvutsrq 0 1996 IEEE 546