C. San Martin and S.-W. Kim (Eds.): CIARP 2011, LNCS 7042, pp. 638–645, 2011.
© Springer-Verlag Berlin Heidelberg 2011
A New Clustering Algorithm Based on K-Means Using a
Line Segment as Prototype
Juan Carlos Rojas Thomas
Universidad de Atacama, Copiapó, Chile
juancarlos.rojas@uda.cl
Abstract. This project shows the development of a new clustering algorithm,
based on k-means, which faces its problems with clusters of differences
variances. This new algorithm uses a line segment as prototype which captures
the axis that presents the biggest variance of the cluster. The line segment
adjusts iteratively its long and direction as the data are classified. To perform
the classification, a border region that determines approximately the limit on the
cluster is built based on geometric model, which depends on the central line
segment. The data are classified later according to their proximity to the
different border regions. The process is repeated until the parameters of the all
border regions associated with each cluster remain constant.
Keywords: Clustering, Kmeans, Variance, Central Line Segment, Border Region.
1 Introduction
The process of clustering consists on classifying in an unsupervised way a set of
patterns (observations or data) into groups (clusters) [1]. There are many types of
clustering algorithms. One of these is the center based algorithms. Compared with the
others types of clustering algorithms, the center based algorithms are very efficient
with big data bases and with high dimensional data. Usually, these algorithms try to
minimize an objective function, which defines how good is the solution obtained [2].
1.1 K-Means
The k-means is a clustering algorithm which is considered a center based algorithm.
This algorithm tries to find the k partitions that minimize the objective function. The
objective function used by this algorithm is the mean square error [3]. This criterion,
where m
i
corresponds to the mean of the cluster C
i
, n to the total number of objects,
and k to the total number of clusters, is defined as [4]:
= ∈
- =
k
i c x
i
i
m x
n
E
1
2 1
(1)
In general the k-means algorithm performs the classification of the data according
to a measure of distance to certain points considered the centers of the clusters in a