International Journal of Computer Applications (0975 – 8887) Volume 67– No.19, April 2013 29 Survey on Outlier Detection in Data Mining Janpreet Singh Mtech. Research Scholar Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib, Punjab, India. Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib, Punjab, India. ABSTRACT Data Mining is used to extract useful information from a collection of databases or data warehouses. In recent years, Data Mining has become an important field. This paper has surveyed upon data mining and its various techniques that are used to extract useful information such as clustering, and has also surveyed the techniques that are used to detect the outliers. This paper also presents various techniques used by different researchers to detect outliers and present the efficient result to the user. Keywords: Data Mining, Clustering, Outlier, Outlier Detection 1. INTRODUCTION Data Mining is the task of extracting useful knowledge from a collection of data bases or data warehouses, nowadays data is stored in various formats such as documents, images, audio, videos, scientific data, etc. [1]. The data collected from different applications require proper mechanism of extracting knowledge/information from large repositories for better decision making. Knowledge Discovery in Databases (KDD), often called data mining, aims at the discovery of useful information from large collections of data. [2]. 1.1 Data Preprocessing Preprocessing is the first step of Knowledge discovery. Data are normally preprocessed through data cleaning, data integration, data selection, and data transformation and prepared from the data warehouses and other information repositories [3][4] as shown in fig.1. The figure shows the KDD process that shows how is knowledge formed from the raw data. 1. Data Cleaning: In data cleaning noise is removed from the data, such as removing fields or attribute or variables that are irrelevant. 2. Data Integration: In this step data is collected and combined from multiple heterogeneous resources. 3. Data Selection: Relevant data is selected according to user need. 4. Data Transformation: Data is transformed into appropriate form. It involves smoothing, generalization. Fig.1 Data Mining Knowledge Discovery Process 1.2 Data Mining Functionalities Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. There are various types of databases and information repositories on which data mining can be performed. There are different data mining functionalities such as, 1. Concept/Class Description: Characterization and Discrimination, 2. Classification and Prediction, 3. Cluster Analysis, 4. Evolution and Deviation Analysis, 5.Outlier Analysis. [3]. 2. CLUSTERING Clustering is the process of grouping similar objects that are different from other objects. Clustering is an unsupervised classification technique, which means that it does not have any prior knowledge of its data and results before classifying the data [5]. For example: if we want to arrange the books on the book shelf and want to retrieve them quickly and easily then we can group the books in such a way that similar book form a one group and other from another group, such grouping is known as clustering. Cluster analysis is used in a number of applications such as data analysis, image processing, market analysis etc [6]. The term clustering is also used by several research communities to describe the method of grouping unlabeled data. Clustering is used to improve the efficiency of the result by making groups of the data. So to cluster the data means specifying the data objects to a specific cluster which has similar objects or a group of objects. 2.1 Clustering Methods Clustering is used to classify the data into different clusters. There are various clustering methods used today are: A. Hierarchical Clustering Method