A Cluster-Oriented Genetic Algorithm for Alternative Clustering Duy Tin Truong, Roberto Battiti University of Trento, Italy Abstract—Supervised alternative clusterings is the problem of finding a set of clusterings which are of high quality and different from a given negative clustering. The task is therefore a clear multi-objective optimization problem. Optimizing two conflicting objectives requires dealing with trade-offs. Most approaches in the literature optimize these objectives sequen- tially or indirectly, resulting in solutions which are dominated. We develop a multi-objective algorithm, called COGNAC, able to optimize the objectives directly and simultaneously and producing solutions approximating the Pareto front. COGNAC performs the recombination operator at the cluster level instead of the object level as in traditional genetic algorithms. It can accept arbitrary clustering quality and dissimilarity objectives and provide solutions dominating those of other state-of-the-art algorithms. COGNAC can also be used to generate a sequence of alternative clusterings, each of which is guaranteed to be different from all previous ones. Keywords-alternative clustering; multi-objective optimiza- tion; cluster-oriented; genetic algorithm. I. I NTRODUCTION Given a dataset, traditional clustering algorithms often provide a single set of clusters or a single view of that dataset. On complex tasks, different interesting ways of grouping items can exist, therefore a natural requirement is to ask for alternative clusterings to get complementary views. There have been many techniques developed for solving the alternative clustering problem. In unsupervised alternative clustering, the algorithm automatically generates a set of clusterings of high quality and different from each other. In supervised alternative clustering, the algorithm allows users to direct the search by explicitly labelling some clusterings as undesired or negative. This is useful when users already know some trivial or negative clusterings of the dataset, and they ask for different - and potentially more informative - clusterings. This paper focuses on supervised alternative clustering: a multi-objective optimization problem (MOP) with two objectives of clustering quality and dissimilarity. The goal is to find a representative set of Pareto-optimal solutions. In MOP, a solution is termed Pareto-optimal if there is no solution which improves at least one objective without worsening the other objectives. The Pareto front is the set of all Pareto-optimal solutions in the objective space. Most approaches in the literature only optimize these two objectives sequentially (optimizing one objective first and then optimizing the other one) [1], [2] or indirectly by some heuristics [3]. Other methods combine these two objectives into a single one and then optimize this single objective [4]. Solving a multi-objective optimization problem in the above ways can result in solutions which are not Pareto- optimal, or in a single solution or in a very limited number of solutions on the Pareto front. The user flexibility is thus limited because the trade-off between the different objectives is decided a priori, before knowing the possible range of solutions. The trade-off can be decided in a better way a posteriori, by generating a large set of representative solutions along the Pareto front and then having the user pick the favorite one among them. To deal with the above issues, we propose an explicit multi-objective algorithm, called Cluster-Oriented GeNetic algorithm for Alternative Clusterings (COGNAC), capable of (i) optimizing directly and simultaneously the clustering quality and dissimilarity, (ii) generating a sequence of alternative clusterings, each of which is different from previous ones. The rest of this paper is organized as follows. We describe our algorithm COGNAC in Section II and show how our algorithm can generate a sequence of different alternative clusterings and compare the performance of our algorithm with that of other state-of-the-art algorithms in Section III. Related Work. Bae et al. [3] propose an alternative clustering algorithm, called COALA, which extends the traditional agglomerative hierarchical clustering algorithm by considering also cannot-link constraints (generated from negative clusterings) when merging two nearest clusters. As this algorithm only considers cannot-link constraints, useful information obtained through must-link constraints is lost. In addition, the application scope of the method is limited to agglomerative clustering algorithms. To over- come the scope limitation, Davidson et al. [1] propose a method, called AFDT which transforms the dataset into a different space, where the negative clustering is diffi- cult to be detected, and then use an arbitrary clustering algorithm to partition the transformed dataset. However, such transformation can destroy the dataset’s characteristics. Qi et al. [2] fix this problem by finding a transformation which minimizes the Kullback-Leibler divergence between the probability distribution of the dataset in the original space and in the transformation space, under the constraint that the negative clustering should not be detected. We refer this algorithm as AFDT2. Those algorithms can only accept one negative clustering. Nguyen et al. [4] propose MinCEntropy ++ , which can accept a set of N C negative clusterings. MinCEntropy ++ finds an alternative cluster-