Improving constrained clustering with active query selection Viet-Vu Vu a,n , Nicolas Labroche a , Bernadette Bouchon-Meunier b a UPMC Univ Paris 06, UMR 7606, LIP6, F-75005 Paris, France b CNRS, UMR 7606, LIP6, F-75005 Paris, France article info Article history: Received 14 February 2011 Received in revised form 14 September 2011 Accepted 5 October 2011 Available online 9 November 2011 Keywords: Active semi-supervised clustering Pairwise constraints k-Nearest neighbors graph abstract In this article, we address the problem of automatic constraint selection to improve the performance of constraint-based clustering algorithms. To this aim we propose a novel active learning algorithm that relies on a k-nearest neighbors graph and a new constraint utility function to generate queries to the human expert. This mechanism is paired with propagation and refinement processes that limit the number of constraint candidates and introduce a minimal diversity in the proposed constraints. Existing constraint selection heuristics are based on a random selection or on a min–max criterion and thus are either inefficient or more adapted to spherical clusters. Contrary to these approaches, our method is designed to be beneficial for all constraint-based clustering algorithms. Comparative experiments conducted on real datasets and with two distinct representative constraint-based clustering algorithms show that our approach significantly improves clustering quality while mini- mizing the number of human expert solicitations. & 2011 Elsevier Ltd. All rights reserved. 1. Introduction In recent years, clustering with constraints (also known as clustering with side information) has become a topic of significant interest for many researchers. These methods allow to take into account a user’s knowledge expressed as a set of constraints to improve the clustering results. There exist several families of constraints but the most used are: must-link (ML) and cannot- link (CL) constraints [1]. On the one hand, ML constraints indicate that two points of the dataset have to be grouped in the same cluster. On the other hand, CL constraints impose that the points belong to different clusters. Constraints are also used in other domains such as constrained classification [2,3], and feature selection [4]. We can divide previous works on clustering with constraints into two main families: either (1) distance-based methods: the constraints help the algorithm to learn a metric/objective func- tion [5–12] or (2) constraint-based method: the constraints are used as hints to guide the algorithm to a useful solution [1,13–18]. Following [19], given a constraint set, the distance-based methods are first trained to ‘‘satisfy’’ the constraints so that, after training, data objects associated by a must-link constraint should be close and data objects linked by a cannot-link constraint should be well separated in the learning space. Some distance measures have been used for distance-based constrained clustering: string-edit distance trained using EM, Jensen–Shannon divergence trained using gradient descent, Euclidean distance modified by shortest-path algorithm, and Mahalanobis distance trained using convex optimization. Some recent techniques include learning a distance metric transforma- tion that is globally linear but locally non-linear, and learning a margin-based clustering distortion measure using boosting. In constraint-based approaches, two families of methods can be found: on the one hand, algorithms with a strict enforcement, which find the best feasible clustering respecting all the constraints, and, on the other hand, algorithms with partial enforcement, which find the best clustering while maximally respecting the constraints. To this aim, several techniques have been proposed so far in the literature: modifying the clustering objective function so that it includes a term of constraint satisfiability, enforcing all constraints to be satisfied during the assignment step in the clustering process, or initializing clusters and inferring clustering constraints based on neighborhoods derived from labeled example [19]. The motivation of our work focuses on two open questions that follow: 1. How can we determine the utility of a given constraint set, prior to clustering [20]? The need for a constraint utility measure has become imperative with the recent observation that some poorly defined constraint sets can decrease clustering performance [20–22]. In this article, we define a set of desirable properties for such a utility measure and we propose a first implementation based on these properties. Our measure evaluates the ability of a constraint Contents lists available at SciVerse ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2011.10.016 n Correspondence to: LIP6 - Universite ´ Pierre et Marie Curie - Paris 6 UMR CNRS 7606, 4 place Jussieu, case 169, 75252 Paris cedex 05, France. Tel.: þ33 1 44 27 88 87; fax: þ33 1 44 27 70 00. E-mail addresses: viet-vu.vu@lip6.fr (V.-V. Vu), nicolas.labroche@lip6.fr (N. Labroche), bernadette.bouchon-meunier@lip6.fr (B. Bouchon-Meunier). Pattern Recognition 45 (2012) 1749–1758