Linear Time Algorithms for Clustering Problems in any dimensions Amit Kumar 1 , Yogish Sabharwal 2 , and Sandeep Sen 3 1 Dept of Comp Sc & Engg, Indian Institute of Technology, New Delhi-110016, India amitk@cse.iitd.ernet.in 2 IBM India Research Lab, Block-I, IIT Delhi, Hauz Khas, New Delhi-110016, India ysabharwal@in.ibm.com 3 Dept of Comp Sc & Engg, Indian Institute of Technology, Kharagpur, India ssen@cse.iitkgp.ernet.in Abstract. We generalize the k-means algorithm presented by the au- thors [14] and show that the resulting algorithm can solve a larger class of clustering problems that satisfy certain properties (existence of a ran- dom sampling procedure and tightness). We prove these properties for the k-median and the discrete k-means clustering problems, resulting in O(2 (k/ε) O(1) dn) time (1 + ε)-approximation algorithms for these prob- lems. These are the first algorithms for these problems linear in the size of the input (nd for n points in d dimensions), independent of dimensions in the exponent, assuming k and ε to be fixed. A key ingredient of the k-median result is a (1 + ε)-approximation algorithm for the 1-median problem which has running time O(2 (1/ε) O(1) d). The previous best known algorithm for this problem had linear running time. 1 Introduction The problem of clustering a group of data items into similar groups is one of the most widely studied problems in computer science. Clustering has applications in a variety of areas, for example, data mining, information retrieval, image processing, and web search ([5, 7, 16, 9]). Given the wide range of applications, many different definitions of clustering exist in the literature ([8, 4]). Most of these definitions begin by defining a notion of distance (similarity) between two data items and then try to form clusters so that data items with small distance between them get clustered together. Often, clustering problems arise in a geometric setting, i.e., the data items are points in a high dimensional Euclidean space. In such settings, it is natural to define the distance between two points as the Euclidean distance between them. Two of the most popular definitions of clustering are the k-means clus- tering problem and the k-median clustering problem. Given a set of points P , the k-means clustering problems seeks to find a set K of k centers, such that ∑ p∈P d(p, K) 2 is minimized, whereas the k-median clustering problems seeks to find a set K of k centers, such that ∑ p∈P d(p, K) is minimized. Note that the points in K can be arbitrary points in the Euclidean space. Here d(p, K) refers