A Proposed Outliers Identiﬁcation Algorithm for Categorical Data Sets Ayman Taha #1 , Osman M. Hegazy #2 # Faculty of Computers and Information Cairo University - Egypt. 1 iyman.taha@gmail.com 2 osman.hegazy@gmail.com Abstract- Outliers are a minority of observations that are inconsistent with the pattern suggested by the majority of observations. Outliers identiﬁcation algorithms for categorical data sets face many limitation because measuring distance is not common in categorical data. In this paper, we propose a new unsupervised outliers identiﬁcation method in categorical data sets. In contrast to other outliers identiﬁcation methods, the proposed method considers number of categories inside categorical variables. Experimental results show that the proposed method has a comparable performance results with respect to other outliers identiﬁcation methods in performance. Keywords- Outliers Detection, Categorical Data, Data Mining . I. I NTRODUCTION Outlier are observations that are highly inconsistent with other observations and arouse suspicion that they were generated by a different mechanism [7]. There are two main viewpoints to outliers identiﬁcation process; as a major pre-processing process for data mining and as a data mining technique. The ﬁrst viewpoint considers that outliers, if they exist, can affect measurements on other points. Consequently, this approach deﬁnes outliers iden- tiﬁcation process as an important process before data min- ing applications. However, the second viewpoint classiﬁes outliers identiﬁcation as one of data mining processes where outliers are the most informative observations in the data sets. Outliers Identiﬁcation has several important applica- tions such as identifying errors and unexpected entries in databases, identifying new topic in text mining ap- plications [11], identifying abnormal locations in spatial domain [18], identifying abnormal or catastrophic events in time series data, identifying fraudulent credit cards [15] and identifying unauthorized access or intrusion in computer networks [12]. Outliers detection approaches can be classiﬁed into three main approaches: supervised, unsupervised and semi supervised approach. In the supervised outliers detection approach labeled samples are used in training phase to learn the behavior of normal and abnormal points and then other points are tested. Observations having behavior similar to abnormal points are labeled as outliers and other observations are labeled as normal observations. While unsupervised outliers detection approach processes the data as a static distribution, ﬁnds the most remote points and highlights them as outliers. Unsupervised approach does not require prior knowledge but it requires all data to be available before processing. However semi-supervised outliers detection approach learns only the behavior of normal observations to deﬁne a boundary of inliers and then it labels observations outside this boundary as out- liers [11]. In this paper, we focus on unsupervised learning outliers detection algorithms because they are more prac- tical than supervised learning; especially in real situations where labeled examples may be unknown or a new fraud patterns that did not appear in training phase may appear in testing phase. Variables can be classiﬁed into two types: continuous variables and categorical variables. Continuous variables have an inﬁnite domain of values such as length, height and depth, while categorical variables have a ﬁnite domain of values such as colors, nationalities and types. This work concerns the problem of outlier identiﬁcation in categorical data sets, where all variables are categorical variables. The distance among categorical values is not regular term, which leads to disappearance of categorical data sets in data mining algorithms. Several distance functions have been proposed in the literature to compute the the distance between categorical observations. These distance functions make use of the following categorical data sets characteristics [4] and [5]: • n: Number of observations. • q: Number of categorical variables. • c i : Number of categories in the i th categorical vari- able. • f k (x): The frequency of category x in the k th