International Journal of Computer Applications (0975 – 8887) Volume 38– No.6, January 2012 30 An Effective Genetic Algorithm for Outlier Detection P. Vishnu Raja Assistant Professor(SrG)/CSE Kongu Engineering College Perundurai, Erode Dr. V. Murali Bhaskaran Principal Pavaai College of Engineering Pachal, Namakkal ABSTRACT The main objective the outlier detection is to find the data that are exceptional from other data in the data set. Detection of such exceptional data’s is an important issue in many fields like fraud detection, Intrusion detection and Medicine . In this paper we are proposing an algorithm to detect outliers using genetic algorithm. The proposed method was exceptionally accurate in identifying the outliers the datasets that we have tested. The result analysis is done on some standard dataset to view accuracy of the algorithm. General Terms Database, Data storage, Information retrieval. Keywords Outliers, Genetic algorithm, Anomalies, Exceptional objects Optimization. 1. INTRODUCTION Detecting outliers is an important issue in many of the common applications like fraud detection, intrusion detection, medicine, network robustness analysis and so on. Finding the Outliers or the rare instances will be more interesting compared to identifying the common data of usual form. Outliers in a dataset is defined informally as an observation that is considerably different from the remainders as if it is generated by different mechanism which are exceptional from the remaining data in a dataset[1][2]. In many of the data mining applications identifying the outliers or rare events discovers some new interesting and unexpected knowledge in many areas. It has been examined that in most of the algorithms that are developed to detect anomoly are not accurate [2]. It may detect the false data or an additional data which are not outliers which leads to false result. The results thus produced are also not optimized. In this paper, we proposed an generalized genetic algorithm for identifying the exceptional objects from the dataset which also includes outliers. This is due to the fact that Genetic algorithm are very simple and easy to use and also computationally powerful. Many of the searching and optimization algorithms are not adaptive[10]. In the sense that they generally solve only the given problem. Since the algorithm is designed for their problem alone. But Genetic algorithm are adaptive and robust in nature[9][10], they can be applied to any domain and to any type of problem with slight modifications in the representation, fitness value or with the choice of the genetic operators. But the behavior of the genetic algorithm remains same. So we had chosen Genetic algorithm as our algorithm to solve outliers. In our approach the outliers are identified based on the fitness value that is generated. The fitness value that are lower are considered to be outlier. The remainder part of this paper is described as follows: in section 2 we discussed about the related work done and the proposed work in detail and section 3 describes genetic algorithm in detail and section 4 shows the experimental results of the proposed algorithm. Finally section 5 concludes the paper with future work. 2. RELATED WORK There is no generic approach is done to detect outliers. Many approaches have been proposed to detect or identify the outliers based on density based, distance based, distribution based and clustering based approaches. In the Density based approach they compute the data with density of regions in which low density regions are identified as outliers. In (Breunig 2000, Papadimitriou 2003) assigned LOF(Local Outlier Factor) as an outlier score for to any given data point based on the distance from its local neighborhood. In the Distance based approach[(knorr 2000, angiulli 2005) the outliers are detected by a distance measure on the feature space. In ramasamy(2000), the outliers are identified by using k-nearest neighbor method to rank the outliers. The problem with this approach is that it is very difficult to find a particular value in a dataset[3]. Distribution based approach (Rosseeuw 1996) had developed statistical methods from the given data and applied statistical test to find the object belong to a particular model or not. The object with low probability are identified as outliers in the statistical model. Because the distribution based approaches are univariate in nature they cannot be applied in multidimensional data space Clustering based approach (Achuna and Rodriguez 2004) identified outliers as clusters of small sizes. The advantage of this approach is that it may not be supervised. Hierarchical based approach( Loureiro, 2004 and Almeida 2006) was used to identify the outliers by using the resultant clusters as an indicator to identify the outliers. Many algorithms have been proposed to identify the outliers but optimized solution has not been defined. In this paper we proposed Genetic algorithm based outlier detection to have effective optimum result. 3. GENETIC ALGORITHM Compared to other searching algorithms Genetic algorithms are adaptive heuristic and robust in nature which implies that they can be applied problems of any domain with slight modification of the representation, fitness evaluation and the choice of the genetic operators but the basic operation of the