Learning Similarities by Accumulating Evidence in a Probabilistic Way Helena Aidos and Ana Fred Instituto de Telecomunica¸ oes, Instituto Superior T´ecnico, Universidade de Lisboa, Lisbon, Portugal {haidos,afred}@lx.it.pt Abstract. Clustering ensembles take advantage of the diversity pro- duced by multiple clustering algorithms to produce a consensual parti- tion. Evidence accumulation clustering (EAC) combines the output of a clustering ensemble into a co-association similarity matrix, which con- tains the co-occurrences between pairs of objects in a cluster. A consensus partition is then obtained by applying a clustering technique over this matrix. We propose a new combination matrix, where the co-occurrences between objects are modeled in a probabilistic way. We evaluate the proposed methodology using the dissimilarity increments distribution model. This distribution is based on a high-order dissimilarity measure, which uses triplets of nearest neighbors to identify sparse and odd shaped clusters. Experimental results show that the new proposed algorithm pro- duces better and more robust results than EAC in both synthetic and real datasets. Keywords: Clustering ensembles, co-association matrix, voting scheme, probablistic learning of similarities, dissimilarity increments distribution. 1 Introduction Many clustering algorithms have been developed, each producing a different par- tition for a given dataset, and typically relying on a similarity measure between objects, which can be difficult to choose when no prior knowledge about cluster shapes and structure is available. Furthermore, one single clustering algorithm, with a given similarity measure, can also produce different solutions for the same dataset, depending on the initialization or parameters values, e.g., k-means. To exploit that diversity, an approach called clustering ensemble (CE) has been developed [13,10,3], producing a set of data partitions. These methods combine information given by the set of data partitions produced, and propose a consensus partition. Moreover, it has been shown that CE methods uncover a more robust and stable cluster structure than a single clustering algorithm [6,13]. To combine information from the set of data partitions, different paradigms were followed: (i) similarity between objects, induced by the clustering ensemble [6,13,7]; (ii) similarity between partitions [4,3]; (iii) combining similarity between objects and partitions [5]; (iv) probabilistic approaches to CEs [14,15]. E. Bayro-Corrochano and E. Hancock (Eds.): CIARP 2014, LNCS 8827, pp. 596–603, 2014. c Springer International Publishing Switzerland 2014