C. San Martin and S.-W. Kim (Eds.): CIARP 2011, LNCS 7042, pp. 638–645, 2011. © Springer-Verlag Berlin Heidelberg 2011 A New Clustering Algorithm Based on K-Means Using a Line Segment as Prototype Juan Carlos Rojas Thomas Universidad de Atacama, Copiapó, Chile juancarlos.rojas@uda.cl Abstract. This project shows the development of a new clustering algorithm, based on k-means, which faces its problems with clusters of differences variances. This new algorithm uses a line segment as prototype which captures the axis that presents the biggest variance of the cluster. The line segment adjusts iteratively its long and direction as the data are classified. To perform the classification, a border region that determines approximately the limit on the cluster is built based on geometric model, which depends on the central line segment. The data are classified later according to their proximity to the different border regions. The process is repeated until the parameters of the all border regions associated with each cluster remain constant. Keywords: Clustering, Kmeans, Variance, Central Line Segment, Border Region. 1 Introduction The process of clustering consists on classifying in an unsupervised way a set of patterns (observations or data) into groups (clusters) [1]. There are many types of clustering algorithms. One of these is the center based algorithms. Compared with the others types of clustering algorithms, the center based algorithms are very efficient with big data bases and with high dimensional data. Usually, these algorithms try to minimize an objective function, which defines how good is the solution obtained [2]. 1.1 K-Means The k-means is a clustering algorithm which is considered a center based algorithm. This algorithm tries to find the k partitions that minimize the objective function. The objective function used by this algorithm is the mean square error [3]. This criterion, where m i corresponds to the mean of the cluster C i , n to the total number of objects, and k to the total number of clusters, is defined as [4]:  = - = k i c x i i m x n E 1 2 1 (1) In general the k-means algorithm performs the classification of the data according to a measure of distance to certain points considered the centers of the clusters in a