Adaptive Sampling for k-Means Clustering Ankit Aggarwal 1 , Amit Deshpande 2 , and Ravi Kannan 2 1 IIT Delhi zenithankit@gmail.com 2 Microsoft Research India {amitdesh,kannan}@microsoft.edu Abstract. We show that adaptively sampled O(k) centers give a con- stant factor bi-criteria approximation for the k-means problem, with a constant probability. Moreover, these O(k) centers contain a subset of k centers which give a constant factor approximation, and can be found us- ing LP-based techniques of Jain and Vazirani [JV01] and Charikar et al. [CGTS02]. Both these algorithms run in effectively O(nkd) time and ex- tend the O(log k)-approximation achieved by the k-means++ algorithm of Arthur and Vassilvitskii [AV07]. 1 Introduction k-means is a popular objective function used for clustering problems in computer vision, machine learning and computational geometry. The k-means clustering problem on given n data points asks for a set of k centers that minimizes the sum of squared distances between each point and its nearest center. To write it formally, the k-means problem asks: Given a set X R d of n data points and an integer k> 0, find a set C R d of k centers that minimizes the following potential function. φ(C)= xX min cC x c 2 We denote by φ A (C)= xA min cC x c 2 the contribution of points in a subset A X . Let C OPT be the set of optimal k centers. In the optimal solution, each point of X is assigned to its nearest center in C OPT . This induces a natural partition on X as A 1 A 2 ∪···∪ A k into disjoint subsets. There is a variant of the k-means problem known as the discrete k-means problem where the centers have to be points from X itself. Note that the optima of the k-means problem and its discrete variant are within constant factors of each other. There are other variants where the objective is to minimize the sum of p-th powers of distances instead of squares (for p 1), or to be more precise, xX min cC x c p 1/p . The p = 1 case is known as the k-median problem and the p = case is known as the k-center problem. Moreover, one can also ask the discrete k-means problem over arbitrary metric spaces instead of R d . I. Dinur et al. (Eds.): APPROX and RANDOM 2009, LNCS 5687, pp. 15–28, 2009. c Springer-Verlag Berlin Heidelberg 2009