Linear Time Algorithms for Clustering Problems in any dimensions Amit Kumar 1 , Yogish Sabharwal 2 , and Sandeep Sen 3 1 Dept of Comp Sc & Engg, Indian Institute of Technology, New Delhi-110016, India amitk@cse.iitd.ernet.in 2 IBM India Research Lab, Block-I, IIT Delhi, Hauz Khas, New Delhi-110016, India ysabharwal@in.ibm.com 3 Dept of Comp Sc & Engg, Indian Institute of Technology, Kharagpur, India ssen@cse.iitkgp.ernet.in Abstract. We generalize the k-means algorithm presented by the au- thors [14] and show that the resulting algorithm can solve a larger class of clustering problems that satisfy certain properties (existence of a ran- dom sampling procedure and tightness). We prove these properties for the k-median and the discrete k-means clustering problems, resulting in O(2 (k/ε) O(1) dn) time (1 + ε)-approximation algorithms for these prob- lems. These are the ﬁrst algorithms for these problems linear in the size of the input (nd for n points in d dimensions), independent of dimensions in the exponent, assuming k and ε to be ﬁxed. A key ingredient of the k-median result is a (1 + ε)-approximation algorithm for the 1-median problem which has running time O(2 (1/ε) O(1) d). The previous best known algorithm for this problem had linear running time. 1 Introduction The problem of clustering a group of data items into similar groups is one of the most widely studied problems in computer science. Clustering has applications in a variety of areas, for example, data mining, information retrieval, image processing, and web search ([5, 7, 16, 9]). Given the wide range of applications, many diﬀerent deﬁnitions of clustering exist in the literature ([8, 4]). Most of these deﬁnitions begin by deﬁning a notion of distance (similarity) between two data items and then try to form clusters so that data items with small distance between them get clustered together. Often, clustering problems arise in a geometric setting, i.e., the data items are points in a high dimensional Euclidean space. In such settings, it is natural to deﬁne the distance between two points as the Euclidean distance between them. Two of the most popular deﬁnitions of clustering are the k-means clus- tering problem and the k-median clustering problem. Given a set of points P , the k-means clustering problems seeks to ﬁnd a set K of k centers, such that ∑ p∈P d(p, K) 2 is minimized, whereas the k-median clustering problems seeks to ﬁnd a set K of k centers, such that ∑ p∈P d(p, K) is minimized. Note that the points in K can be arbitrary points in the Euclidean space. Here d(p, K) refers