Jackknife distances for clustering time–course gene expression data Theresa Scharl 1 , Friedrich Leisch 2 Department of Statistics and Probability Theory, Vienna University of Technology, Wiedner Hauptstraße 8-10/1071, A-1040 Wien, Austria; Theresa.Scharl@ci.tuwien.ac.at 1 Department of Statistics, University of Munich, Ludwigstraße 33, D-80539 M¨ unchen, Germany; Friedrich.Leisch@stat.uni-muenchen.de 2 Abstract Clustering time–course gene expression data is a com- mon tool to find co–regulated genes and groups of genes with similar temporal or spatial expression patterns. The distance measure used for clustering has major impact on the properties of the resulting clusters. As technical problems can easily distort the microarray data there is a need for distance measures which are able to deal with outliers. Here we present new so-called ”Jackknife” dis- tance measures which can handle outlier time points. In a simulation study on a publicly available dataset from yeast the utility of such distance measures is investigated. Keywords: Cluster Analysis, Time–course Microarray Data, Distance Measures, R. 1 Introduction The interpretation of enormous amounts of data from microarrays has been a challenging task in statistics and bioinformatics for the past few years. One possible ap- proach to deal with the complexity of the data is clus- ter analysis which has been widely applied for example for grouping tissue samples in cancer research (e.g. Pol- lard and van der Laan, 2005; Thomas et al., 2001) or for clustering time–course gene expression data. Time– course microarray experiments make it possible to look at the gene expression of thousands of genes at several time points simultaneously. Genes with similar expres- sion pattern are co–expressed genes which are likely to be co–regulated. Hence clustering gene expression pat- terns may help to find groups of co–regulated genes or to identify common temporal or spatial expression pat- terns. Finally cluster results can suggest functional path- ways and interactions between genes (Eisen et al., 1998; Tavazoie et al., 1999; Ben–Dor et al., 1999). The distance measure used has major impact on the resulting clusters (Gentleman et al., 2005). The proper- ties of different distance measures have to be investigated to be able to answer biological questions more precisely. A comparison of different distance measures which are commonly used in the context of clustering time–course microarray data was done in Scharl and Leisch (2006). In this paper we want to investigate new distance measures for clustering time-course gene expression data which are robust against outlier variables. There are several algo- rithms which are able to deal with outlier observations. Partitioning around medoids described in Kaufman and Rousseeuw (1990) is a more robust version of k–means for arbitrary distance measures. Trimmed K-means (Cuesta- Albertos et al., 1997) is a robust version of the original algorithm. All these algorithms can handle outliers in the data points. Our goal is to identify outliers in the vari- ables. We want to be robust against outliers in the time points as technical problems like dust or a scratch on the slide can easily distort the microarray data. The Jack- knife correlation which can handle one outlier time point was introduced by Heyer et al. (1999). Here we want to extend this promising approach to further distance mea- sures and investigate the properties of Jackknife correla- tion as well as Jackknife versions of Euclidean, Manhat- tan and Maximum distance. The different distance measures are compared in a sim- ulation study using two cluster algorithms, stochastic QT-Clust (Scharl and Leisch, 2006) and the well–known k–means algorithm. For that purpose two evaluation cri- teria are chosen. To investigate the stability of a method and the agreement between partitions pairwise compar- isons of cluster results are computed using the adjusted Rand Index (Hubert and Arabie, 1985). As a measure of the quality of a partition the sum of within cluster dis- tances is observed. All algorithms used are implemented in R (http://www.r-project.org, R Development Core Team, 2006) package flexclust (Leisch, 2006) available from CRAN (http://cran.r-project.org). flexclust is a flexible toolbox for clustering which allows to try out various distance measures with only minimal program- ming effort. In this simulation study a publicly available dataset from yeast was used, the seventeen time point mi- totic cell cycle data (Cho et al., 1998) available at http://genome-www.stanford.edu. This dataset was preprocessed adapting the instructions given by Heyer et al. (1999). After rescaling the data genes that were expressed at very low levels and did not vary signifi- cantly over the time points were removed. This procedure yields gene expression data on G = 2090 genes (observa- tions) for T = 17 time points (variables). As time point 10 was reported to be an outlier variable the simulations were conducted on the 17 time point dataset as well as on a dataset with time point 10 removed to investigate the functionality of the Jackknife distance measures. ASA Biometrics Section 346