Identification of anomalous time series using functional data analysis Ganesh K. Subramaniam* 1 & R. Varadhan 2 1 AT & T Labs 2 Centre on Ageing and Health, Johns Hopkins University 1 Introduction Given the widespread use of information technology, a large number of data are collected in on-line, real-time environments which results in massive amounts of data. Such time-ordered data can be aggregated with an appropriate time interval, yielding a large volume of equally spaced time series data. Examples of such data includes retail scanning system, point-of-sale systems and call detail records generated by a telecommunications switch. Let us consider a large array of time series t i,j for i =1,...,t. discrete sampling points and j =1,...,N. where N is the “number" of time series and t < n. The anomalies that are observed in the aggregate time series N j =1 t i,j could be a due to a few individual time series. In this paper, we develop an exploratory method based on functional data analysis (Ramsay and Silverman, 1997) to identify these small number of anomalous time series. We use a simulated time series data that contain some of the characteristics of telecommunications network data. 2 Functional data analysis Functional data analysis, which is referred to as “FDA”, is about the analysis of information on curves or functions. FDA is a collection statistical techniques for answering questions like,“What are the main ways in which the curves vary from one time series to another?” But what is unique to functional data is the possibility of also using information on the rates of change or derivatives of the curves. It uses slopes, curvatures, and other characteristics made available because these curves are intrinsically smooth, and we can use this information in many useful ways. The first step in FDA analysis is to convert the time series data to functional form. To do this, we use a basis function, a set of basic functional building blocks φ k (t), which are combined linearly to define actual functions, i.e., f (t)= a 1 φ 1 (t)+ a 2 φ 2 (t)+ .... + a k φ k (t). In our example, we use the B-spline basis function of order 4, which are piecewise polynomial splines. The penalty parameter λ that strike a compromise between fitting the data and keeping the fit smooth is estimated using generalized cross validation (gcv) technique. We then compute the first two derivatives with respect to time for the logarithmic transform of time series. We access the significance of zero crossings of derivatives using the SiZer approach developed by Chaudhuri and Marron (1999) and test for significance of features such as bumps and dips. Using the information in the derivatives, we cluster the time series into various categories: stable, monotonic increasing or decreasing, cyclic and erratic. Figure 1 shows an example of derivatives from our example. All computations were carried out using the S-PLUS FDA package from Insightful Corp. 172