( 29 ( 29 ( 29
n c
m
2
m ik k i
k 1i1
J U,V;X u d x ,v
= =
=
∑∑
Journal of Engineering and Applied Sciences 9 (10-12): 372-377, 2014
ISSN: 1816-949X
© Medwell Journals, 2014
Corresponding Author: Amina Dik, LCS Laboratory, Faculty of Sciences, Mohammed V-Agdal University, UM5A Rabat,
Morocco
372
A New Fuzzy Clustering by Outliers
Amina Dik, Khalid Jebari, Abdelaziz Bouroumi and Aziz Ettouhami
1 1 1,2 1
LCS Laboratory, Faculty of Sciences, Mohammed V-Agdal University, UM5A Rabat, Morocco
1
LMI Laboratory, Ben M’sik Faculty of Sciences, University Hassan II Mohammedia (UH2M)
2
Casablanca, Morocco
Abstract: This study presents a new approach for partitioning data sets affected by outliers. The proposed
scheme consists of two main stages. The first stage is a preprocessing technique that aims to detect data value
to be outliers by introducing the notion of object’s proximity degree. The second stage is a new procedure
based on the Fuzzy C-Means (FCM) algorithm and the concept of outliers clusters. It consists to introduce
clusters for outliers in addition to regular clusters. The proposed algorithm initializes their centers by the
detected possible outliers. Final and accurate decision is made about these possible outliers during the process.
The performance of this approach is also illustrated through real and artificial examples.
Key words: Similarity measure, outlier detection, FCM, proximity degree, illustrated
INTRODUCTION outliers as proposed in noise clustering, each outlier is
The goal of data clustering is to find a structure in approach offers the possibility to remove or not such
dataset (Jain, 2010). It aims to organize a set of objects points and the adapted FCM algorithm called Possible
into homogeneous clusters such as objects in the same Outliers FCM (POFCM) allows reducing the influence of
cluster should be more similar to each other than are those outliers on the regular clusters.
belonging to different clusters (Bouroumi et al., 2000).
Clustering has been widely applied in several different Related work: Several clustering algorithms are proposed
fields and various disciplines.
Several clustering algorithms are proposed in the
literature. The most widely used clustering algorithm is
Fuzzy C-Means (FCM) originally proposed by Bezdek
(1981). FCM has been widely used and adapted
(Krishnapuram and Keller, 1993; Bezdek et al., 1999;
Hathaway and Bezdek, 2001). However, FCM is sensitive
to outliers. They lead FCM to have difficulties in
extracting the clusters correctly (Jain, 2010; Jolion and
Rosenfeld, 1989).
Several methods have been proposed to detect
outliers (Dave and Sen, 1997; Dave and Krishnapuram,
1997) a new concept of noise cluster was introduced
(Dave, 1991; Ohashi, 1984). Unfortunately, these methods
require some parameters that are not trivial to estimate.
This study presents an approach of identifying
possible outliers and partitioning data sets containing
outliers by an adapted FCM algorithm. The proposed
approach deals with the outliers problem by introducing
two concepts: object’s degree of proximity and outliers
clusters. The first reflects the closeness of an object to
other considered objects. The second signifies that
instead of considering a single noise cluster containing all
considered as center to an outlier cluster. The proposed
in the literature. The most widely used clustering
algorithm is FCM originally proposed by Bezdek (1981).
Based on fuzzy set theory, this algorithm allows each
point to have a degree of belonging to all clusters instead
of belonging to one cluster. It partitions the considered
dataset X = {x , x ,…, x } d U where x 0U represents an
1 2 n i
p p
object and x its jth feature. Similar, objects are in the same
ij
cluster and dissimilar objects belong to different clusters.
FCM optimizes an objective function J defined by:
m
(1)
Where:
m (1<m<4 ) = Weighting exponent used to control
the relative contribution of each object
vector x and the fuzziness degree of
i
the final partition
u = Degree to which the object x belongs
ik k
to the ith cluster (1# i# c and 1# k# n)
V (v , v ,…, v ) = c-tuple of prototypes, each prototype
1 2 c
characterizes one of the c clusters
d (x ,v ) = Distance between the ith prototype
k i
and the kth object