( 29 ( 29 ( 29 n c m 2 m ik k i k 1i1 J U,V;X u d x ,v = = = ∑∑ Journal of Engineering and Applied Sciences 9 (10-12): 372-377, 2014 ISSN: 1816-949X © Medwell Journals, 2014 Corresponding Author: Amina Dik, LCS Laboratory, Faculty of Sciences, Mohammed V-Agdal University, UM5A Rabat, Morocco 372 A New Fuzzy Clustering by Outliers Amina Dik, Khalid Jebari, Abdelaziz Bouroumi and Aziz Ettouhami 1 1 1,2 1 LCS Laboratory, Faculty of Sciences, Mohammed V-Agdal University, UM5A Rabat, Morocco 1 LMI Laboratory, Ben M’sik Faculty of Sciences, University Hassan II Mohammedia (UH2M) 2 Casablanca, Morocco Abstract: This study presents a new approach for partitioning data sets affected by outliers. The proposed scheme consists of two main stages. The first stage is a preprocessing technique that aims to detect data value to be outliers by introducing the notion of object’s proximity degree. The second stage is a new procedure based on the Fuzzy C-Means (FCM) algorithm and the concept of outliers clusters. It consists to introduce clusters for outliers in addition to regular clusters. The proposed algorithm initializes their centers by the detected possible outliers. Final and accurate decision is made about these possible outliers during the process. The performance of this approach is also illustrated through real and artificial examples. Key words: Similarity measure, outlier detection, FCM, proximity degree, illustrated INTRODUCTION outliers as proposed in noise clustering, each outlier is The goal of data clustering is to find a structure in approach offers the possibility to remove or not such dataset (Jain, 2010). It aims to organize a set of objects points and the adapted FCM algorithm called Possible into homogeneous clusters such as objects in the same Outliers FCM (POFCM) allows reducing the influence of cluster should be more similar to each other than are those outliers on the regular clusters. belonging to different clusters (Bouroumi et al., 2000). Clustering has been widely applied in several different Related work: Several clustering algorithms are proposed fields and various disciplines. Several clustering algorithms are proposed in the literature. The most widely used clustering algorithm is Fuzzy C-Means (FCM) originally proposed by Bezdek (1981). FCM has been widely used and adapted (Krishnapuram and Keller, 1993; Bezdek et al., 1999; Hathaway and Bezdek, 2001). However, FCM is sensitive to outliers. They lead FCM to have difficulties in extracting the clusters correctly (Jain, 2010; Jolion and Rosenfeld, 1989). Several methods have been proposed to detect outliers (Dave and Sen, 1997; Dave and Krishnapuram, 1997) a new concept of noise cluster was introduced (Dave, 1991; Ohashi, 1984). Unfortunately, these methods require some parameters that are not trivial to estimate. This study presents an approach of identifying possible outliers and partitioning data sets containing outliers by an adapted FCM algorithm. The proposed approach deals with the outliers problem by introducing two concepts: object’s degree of proximity and outliers clusters. The first reflects the closeness of an object to other considered objects. The second signifies that instead of considering a single noise cluster containing all considered as center to an outlier cluster. The proposed in the literature. The most widely used clustering algorithm is FCM originally proposed by Bezdek (1981). Based on fuzzy set theory, this algorithm allows each point to have a degree of belonging to all clusters instead of belonging to one cluster. It partitions the considered dataset X = {x , x ,…, x } d U where x 0U represents an 1 2 n i p p object and x its jth feature. Similar, objects are in the same ij cluster and dissimilar objects belong to different clusters. FCM optimizes an objective function J defined by: m (1) Where: m (1<m<4 ) = Weighting exponent used to control the relative contribution of each object vector x and the fuzziness degree of i the final partition u = Degree to which the object x belongs ik k to the ith cluster (1# i# c and 1# k# n) V (v , v ,…, v ) = c-tuple of prototypes, each prototype 1 2 c characterizes one of the c clusters d (x ,v ) = Distance between the ith prototype k i and the kth object