Consensus clustering from heterogeneous measures of S. Lycopersicum Milton Pividori 1,2 , Georgina Stegmayer 1,2 , Fernando Carrari 3 , Diego Milone 2 1 CIDISI (CONICET), Universidad Tecnológica Nacional, Facultad Regional Santa Fe, Santa Fe, Argentina 2 sinc(i) (CONICET), Universidad Nacional del Litoral, Facultad de Ingeniería y Ciencias Hídricas, Santa Fe, Argentina 3 INTA-Institute of Biotechnology, Castelar, Argentina Background Clustering methods are key tools that aid in understanding the structure of biological datasets. However, the correct choice of a clustering algorithm requires the user to have previous knowledge about the data distributions assumed by the algorithm and also about how its parameter setting affects final results. Moreover, biological data are often composed of heterogeneous measures of the same objects, like gene expressions, metabolite profiles and other phenotypic measures. In the last years, consensus clustering has emerged as an attempt to solve these problems by combining a set of different clustering solutions (or partitions of the data) into a single consolidated one [1]. Plain clustering algorithms require the integration of all diverse data sources, which involves complex preprocessing of the data (like normalization). Instead, the consensus clustering method provides a simplified and high-quality approach to consider each measure type separately in a first step. Then, by using a consensus function, it combines the multiple solutions into a consensus partition, maximizing the information used from all data sources and providing better results. Several works have studied this new clustering approach and proposed different consensus functions [2–4]. Proposal In this work, a new method based on groups of solutions for consensus clustering is proposed. It consists in two steps. In the first one, for each data source, clustering solutions are generated using the k-means algorithm (KM) and varying k, the number of clusters. For each k value, KM is run several times with random initializations, producing a group of solutions and thus extracting all the information from each KM configuration. This means that solutions within a group have the same number of clusters, and that all the groups have the same size. In the second step, this heterogeneous information is combined into a single clustering solution by using a supra-consensus function [1], which tries to use as much information from each source as possible. The proposed approach, named group consensus KM (gcKM), provides an extensive analysis for each data source: a wide range of different KM solutions over the same data is generated, obtaining different points of view for each source of data. The method also avoids the requirement of data preprocessing (normalization) and obtains better results than plain KM (pKM). Materials and methods Metabolite profiles, antioxidant capacity, sensory panels and volatile profiles measured in fruits and leaves from 8 different Solanum lycopersicum (tomato) accessions collected along Andean valleys of Argentina [5] were used as a source of data. Data were generated in three independent replicates, giving a total of 24 objects. For the gcKM method, all the different data sources were analyzed separately by running KM over each one of them. The number of clusters was set in the range from 2 to 10, obtaining 9 groups of solutions. As the group size was set to 20, a total of 180 clustering solutions for each data source was generated. After that, to obtain a final result, the gcKM method combined all the clustering solutions from each data source, thus obtaining a consensus solution. Note that, at this step, the consensus function does not access the raw data, just the clustering solutions given by the groups of KM partitions. The proposed method (gcKM) was compared with a classical approach (pKM), which consisted in running KM over the complete normalized dataset. To assess the quality of both approaches, two aspects were evaluated: (1) if accessions replicates are clustered together when solutions with 8 clusters (the same number of tomato accessions) are obtained; and (2) the amount of information used from each source of data by the final solutions of pKM and gcKM. To obtain the second measure, a set of representative partitions is first derived from each data source. Once these representatives are obtained, the mutual information between them and each final result is computed. This measure was obtained by calculating the Average Normalized Mutual Information (ANMI) [1]: ¯ Υ(Λ, Π ′ )=1/M ∑ M i=1 Υ(Π ′ , Π i ), where M is the number of groups of solutions for each data source, and Υ is the Normalized Mutual Information (NMI) between the final partition Π ′ and each representative partition Π i . Results For the first quality assessment, gcKM always clustered the 3 replicates of each accession together, in contrast with pKM, which failed in 70 % of the cases. The results regarding the amount of information used from each data source in the final clustering (for both gcKM and pKM) are shown in Figure 1. The top of the bars indicate the ANMI mean over 100 repetitions, together with a 95% confidence interval. At the x-axis, the number of clusters used in the final solution is shown. For example, final solutions with 4 clusters (k =4) obtained by gcKM reaches an average ANMI of 0.77 when compared against the sensory panels (S. Pan.) source, whereas pKM obtains 0.57. That is, the amount of information retained by gcKM from this source is higher than the amount obtained by pKM. It is important for the analysis to take into account the relationship between solutions with the same number of clusters over all sources of data. If a clustering solution uses more information from one data source, this may imply using less from another. Thus, it is interesting to assess, in addition to the highest ANMI values, how balanced is the ANMI obtained for gcKM and pKM solutions among all data sources. Although pKM obtains a higher ANMI under certain configurations and sources, gcKM 1