Boosting Clustering by Active Constraint Selection Viet-Vu Vu, Nicolas Labroche, and Bernadette Bouchon-Meunier 1 Abstract. In this paper we address the problem of active query selection for clustering with constraints. The objective is to deter- mine automatically a set of user queries to define a set of must-link or cannot-link constraints. Some works on active constraint learning have already been proposed but they are mainly applied to K-Means like clustering algorithms which are known to be limited to spher- ical clusters, while we are interested in clusters of arbitrary sizes and shapes. The novelty of our approach relies on the use of a k- nearest neighbor graph to determine candidate constraints coupled with a new constraint utility function. Comparative experiments con- ducted on real datasets from machine learning repository show that our approach significantly improves the results of constraints based clustering algorithms. 1 INTRODUCTION In recent years, clustering with constraints (also known as clustering with side information) has become a topic of significant interest for many researchers because these methods allow to take into account a user’s knowledge (called oracle or teacher in this case) - expressed as a set of constraints - to improve the clustering results. There exist several families of constraints but the most used are: must-link (ML) and cannot-link (CL) constraints [25]. ML constraints indicate that two points of the dataset have to be partitioned in the same cluster while CL constraints impose that the points belong to different clus- ters. We can divide previous work on clustering with constraints into two main families: either 1) the constraints help the algorithm to learn a metric/objective function [3, 8, 15, 21, 18, 6, 20] or 2) the constraints are used as hints to guide the algorithm to a useful solu- tion [7, 23, 25, 22]. The motivation of our work focuses on two open questions that follow: 1. How can we determine the utility of a given constraint set, prior to clustering [24]? The need for a constraint utility measure has be- come imperative with the recent observation that some poorly de- fined constraint sets can decrease clustering performances [9, 24]. We will propose a new measure to evaluate a constraint utility. This measure evaluates the ability of a constraint to help the clus- tering algorithm to distinguish the points in the perturbation re- gions, e.g. sparse regions or transition regions. We use this mea- sure to develop an active constraint selection algorithm. 2. How can we minimize the effort required to the user, by only so- liciting her(him) for the most useful constraints [13, 24]? Many researches have been conducted on the problem of clustering with constraints [3, 7, 22, 23, 18, 2, 6, 20, 12] but most of the time the 1 Universit´ e Pierre et Marie Curie - Paris 6, CNRS UMR 7606, LIP6, Paris, France, email: {viet-vu.vu, nicolas.labroche, bernadette.bouchon- meunier}@lip6.fr user is supposed to provide the algorithm with good constraints in a passive manner (see Figure 1). One alternative is to let the user actively choose the constraints. However, as some poorly chosen constraints can lead to a bad convergence of the algorithms [9] and as there is possibly n×(n-1) 2 ML or CL constraints in a datasets with n points, the choice of the constraints appears to be a crucial problem. Some works are proposed on this topic but they only focus on K-Means clustering [1, 19]. This paper presents a new active constraint selection algorithm to collect a constraint set which can be suitable for constrained cluster- ing algorithms that apply to clusters with different sizes and arbitrary shapes (Constrained-DBSCAN [22], Constrained Hierarchical Clus- tering [7], and Constrained Spectral Clustering [27]). Our method relies on a k-nearest neighbor graph to estimate sparse regions of the data where queries about constraints are most likely to be asked. Figure 1. Illustration of passive definition of constraints (top) and active constraints learning (bottom) The rest of the paper is organized as follows: Section 2 discusses the related work. Section 3 presents our new framework for active constraint selection, while section 4 describes the experiments that have been conducted on benchmark datasets. Finally, section 5 con- cludes and discusses future research. 2 RELATED WORKS There are few works on active constraint selection for clustering. In [1], an algorithm for active constraint selection for K-means using farthest-first strategy was proposed. This algorithm is referred to as