Comparing, Contrasting and Combining Clusters in Viral Gene Expression Data Paul Kellam a , Xiaohui Liu b , Nigel Martin c , Christine Orengo d , Stephen Swift b , Allan Tucker b a Department of Immunology and Molecular Pathology, University College London, Gower Street, London, WC1E 6BT, UK b Department of Information Systems and Computing, Brunel University, Uxbridge, Middlesex, UB8 3PH, UK c School of Computer Science and Information Systems, Birkbeck College, Malet Street, London, WC1E 7HX, UK d Department of Biochemistry and Molecular Biology, University College London, Gower Street, London, WC1E 6BT, UK Abstract Massively high-dimensional datasets are fast becoming commonplace and any advances in the reliable partitioning of this kind of data into smaller, relatively independent clusters would be highly beneficial. Nowhere is this more true than functional genomics where gene expression data can contain thousands of variables. Methods to divide the data into manageable portions are urgently needed. This would pave the way for progress in obtaining models of biological processes for explanation and prediction, and ultimately, lead to a great improvement in the quality of human life. In this paper we contrast and compare several different clustering techniques from both the statistical and Artificial Intelligence communities in the context of viral gene expression data. We introduce a method, Clusterfusion, which takes the results of different clustering algorithms and generates a set of robust clusters based upon the consensus of the results of each algorithm. Keywords: Clustering; Gene Expression; Weighted-Kappa 1. Introduction There are many practical applications involving the partition of a set of objects into a number of mutually exclusive subsets. The field of research associated with trying to achieve this based on distance metrics or correlations is known as clustering. Any algorithm that applies a global search for the optimal clusters will run in exponential time to the size of problem space, and so a heuristic or approximate procedure is normally required to cope with most real world problems. This is especially true in the field of functional genomics where gene expression data can contain thousands of variables. Methods to divide the data into manageable portions are urgently needed. This would pave the way for progress in obtaining models of biological processes for explanation and prediction, and ultimately lead to a great improvement in the quality of human life. Much research has been done on clustering and there are many different heuristic algorithms available, perhaps the most common being k-means and hierarchical clustering [8,13]. Nearly all of these algorithms make use of a starting allocation of variables (for example, based upon random points in the dataspace or upon the most correlated variables), and therefore contain bias in their search. They are also prone to getting stuck in local maxima during the search. These methods have been used for partitioning gene expression data [3,6] with some success. However, there is little comparison of the different clustering methods used on expression data. There has also been research into the use of evolutionary algorithms to solve the grouping problem resulting in a more general partitioning method which can be applied to clustering [4]. This method aims to overcome the biases and local maxima involved with heuristic searches but requires fine tuning of parameters. In Section 2 we describe the four clustering techniques investigated in this paper (including an evolutionary approach that we previously developed for clustering time- series variables [11,12]). In Section 3 we contrast and compare these techniques and introduce a method, called Clusterfusion, to generate robust clusters. This allows us to infer with more confidence that certain variables belong within the same cluster. Clusterfusion takes the results of the clustering algorithms and generates a set of robust clusters based upon the consensus of the different results of each algorithm. This method has been tested on viral gene expression data with promising results as discussed in Section 4. 2. Background The four clustering techniques we investigate are based on the correlation between variables. Within this paper we use a well established correlation coefficient, Pearson's correlation coefficient, r [10]. Pearson’s correlation coefficient measures linear relationships between two variables, either discrete or continuous and is defined in equation 1.