IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1 Toward Automated Anomaly Identification in Large-scale Systems Zhiling Lan, Member, IEEE Computer Society, Ziming Zheng, Student Member, IEEE and Yawei Li, Student Member, IEEE Abstract—When a system fails to function properly, health- related data are collected for troubleshooting. However, it is challenging to effectively identify anomalies from the voluminous amount of noisy, high-dimensional data. The traditional manual approach is time-consuming, error-prone, and even worse, not scalable. In this paper, we present an automated mechanism for node-level anomaly identification in large-scale systems. A set of techniques are presented to automatically analyze collected data: data transformation to construct a uniform data format for data analysis, feature extraction to reduce data size, and unsupervised learning to detect the nodes acting differently from others. Moreover, we compare two techniques, principal component analysis (PCA) and independent component analysis (ICA), for feature extraction. We evaluate our prototype implementation by injecting a variety of faults into a production system at NCSA. The results show that our mechanism, in particular, the one using ICA-based feature extraction, can effectively identify faulty nodes with high accuracy and low computation overhead. Index Terms—anomaly identification, large-scale systems, in- dependent component analysis, principal component analysis, outlier detection I. I NTRODUCTION A. Motivation I T has been widely accepted that failures are ongoing facts of life to be dealt with in large-scale systems. Studies have shown that in production systems, failure rates are as high as more than 1000 per year, and depending on root cause of the problem, the average failure repair time ranges from a couple of hours to nearly 100 hours [14], [27]. Every hour that a system is unavailable can cause undesirable loss of processing cycles, as well as substantial maintenance cost. When a system fails to function properly, health-related data are collected across the system for troubleshooting. Unfortu- nately, how to effectively find anomalies and their causes in the data has never been as straight forward as one would expect. Traditionally, human operators are responsible of examining the data with their experience and expertise. Such manual processing is time-consuming, error-prone, and even worse, not scalable. As the size and complexity of computer systems continue to grow, so does the need for automated anomaly identification. Zhiling Lan is with the Department of Computer Science, Illinois Institute of Technology, Chicago, IL 60616. Email: lan@iit.edu. Ziming Zheng is with the Department of Computer Science, Illinois Institute of Technology, Chicago, IL 60616. E-mail: zzheng11@iit.edu Yawei Li is with Google Inc. This work was performed when the author was a student at Illinois Institute of Technology. Email: liyawei@iit.edu Manuscript received August 18, 2008; revised Feb. 7, 2009; accepted March 13, 2009. To address the problem, in this paper we present an automated mechanism for node-level anomaly identification. Different from fine-grained root cause analysis that aims to identify the root causes of problems or faults, ours is a coarse- grained problem localization mechanism focusing on detecting culprit node(s) by automatically examining the health-related data collected across the system. By finding the abnormal nodes, system managers are able to know where to fix the problem and application users can take relevant actions to avoid or mitigate fault impact on their applications. Following the terminology used in the dependability literature [6], a fault — like a hardware defect or a software flaw — can cause system node to transit from a normal state to an error state, and the use of the node in the error state can lead to node failure. Hence, we seek to discover the nodes in error or failed states which are also called abnormal states in the paper; we regard these nodes as anomalies that require further investigation. B. Technical Challenges Finding anomalies is a daunting problem, especially in systems composed of vast amount of nodes. We classify the key challenges into four categories: Data diversity. Depending on the monitoring tools used, the data collected often have different formats and seman- tics, thereby making it hard to process them in a uniform way. Data volume. Due to the size of modern systems, data collected for analysis are characterized by their huge volume, e.g., in order of gigabytes per day [14]. Finding anomalies from such potentially overwhelming amount of data is like finding needles in a haystack. Data dependency. Most measured data are mixtures of independent signals, and they often contain noises. A naive method that directly compares the measured data for anomalies is generally inaccurate by producing sub- stantial amount of false alarms. Anomaly characteristics. In large-scale systems, anomaly types are many and complex. Moreover, node behavior is dynamically changing during operation as system nodes are dynamically assigned to different tasks. Thus, it is ex- tremely difficult to precisely define the normal behaviors of system nodes. C. Paper Contributions The primary contribution of this paper lies in a collection of techniques to find abnormal nodes (i.e., anomalies) via