A MULTIRESOLUTION ENHANCEMENT TO GENERIC CLASSIFIERS OF SUBCELLULAR PROTEIN LOCATION IMAGES Thomas Merryman 1 , Keridon Williams 3 , Gowri Srinivasa 2 , Amina Chebira 2 and Jelena Kova ˇ cevi´ c 2,1 1 Electrical and Computer Engineering and 2 Biomedical Engineering Carnegie Mellon University, Pittsburgh, PA 3 Dept. of Science and Mathematics, University of Virgin Islands, St Thomas, VI ABSTRACT We propose an algorithm for the classiﬁcation of ﬂuorescence mi- croscopy images depicting the spatial distribution of proteins within the cell. The problem is at the forefront of the current trend in bi- ology towards understanding the role and function of all proteins. The importance of protein subcellular location was pointed out by Murphy, whose group produced the ﬁrst automated system for clas- siﬁcation of images depicting these locations, based on diverse fea- ture sets and combinations of classiﬁers. With the addition of the simplest multiresolution features, the same group obtained the high- est reported accuracy of 91.5% for the denoised 2D HeLa data set. Here, we aim to improve upon that system by adding the true power of multiresolution—adaptivity. In the process, we build a system able to work with any feature sets and any classiﬁers, which we de- note as a Generic Classiﬁcation System (GCS). Our system consists of multiresolution (MR) decomposition in the front, followed by fea- ture computation and classiﬁcation in each subband, yielding local decisions. This is followed by the crucial step of combining all those local decisions into a global one, while at the same time ensuring that the resulting system does no worse than a no-decomposition one. On a nondenoised data set and a much smaller number of fea- tures (a combination of texture and Zernicke moment features) and a neural network classiﬁer, we obtain a high accuracy of 89.8%, ef- fectively proving that the space-frequency localized information in the subbands adds to the discriminative power of the system. 1. LOCATION PROTEOMICS AND MULTIRESOLUTION ANALYSIS Among the most important goals in biological sciences today is to understand the role and function of all proteins. One of the criti- cal aspects of a protein’s activity and function is its subcellular lo- cation (PSL), that is, the spatial distribution of proteins within the cell. Today’s method of choice to determine PSL is ﬂuorescence mi- croscopy, its success due in part to the advent of a range of new ﬂu- orescent probes used to tag proteins or molecules of interest, includ- ing the nontoxic, green ﬂuorescent protein (GFP). Once successfully tagged, the cells are imaged using one of many models of ﬂuores- cence microscopes to produce a multidimensional data set—a bioim- age. These bioimages can be just 2D slices or 3D cell/tissue vol- umes (z-stacks). Acquiring bioimages at multiple time instants re- sults in 3D movies (2D time series) or 4D data sets (z-stacks tracked in time). Finally, these microscopes allow for imaging of multiple ﬂuorescence channels, bringing the possible dimensionality of the This work was supported in part by the PA State Tobacco Settlement, Kamlet-Smith Bioinformatics Grant. MR Decomposition Feature Computation Classification Voting Input Image Class Label Fig. 1. MR enhancement to a generic classiﬁcation system. data set to 5D. With the enormous volume of such high-dimensional data sets being generated, human analysis becomes time-consuming, prone to error and ultimately, impractical, leading to the “holy grail” for PSL bioimage interpretation and analysis: develop a system for fast, automatic, and accurate recognition of proteins based on their subcellular location images. Murphy et al. pioneered automated PSL interpretation and anal- ysis, resulting in a system that can classify protein location patterns with well-characterized reliability and better sensitivity than human observers [1, 2]. This work was followed by [3, 4]. With the addition of the simplest multiresolution features, in [1], the authors obtained the highest reported accuracy of 91.5% for the denoised 2D HeLa data set. Problem Statement. The problem we are addressing is that of classifying the spatial distribution patterns of selected proteins within the cell. The challenge in this data set is that images from the same class look different while those from different classes look very similar (see Fig. 2). In the data sets we use, the proteins were labeled using immunoﬂuorescence. So we know the ground truth, that is, which proteins were labeled and subsequently imaged. This is useful for algorithm development as we can test the accuracy of our classiﬁcation scheme. Methodology. As the introduction of the simplest multiresolu- tion features produced a statistically signiﬁcant jump in classiﬁcia- tion accuracy, our aim is to explore more sophisticated multriresolu- tion techniques. In particular, the following are the three character- istics of multiresolution we wish to explore: (a) Localization: Fluorescence microscopy images have highly localized features both in space and frequency. This leads us to MR tools, as they have been found to be the most appropriate tools for computing and isolating such localized features [5]. (b) Adaptivity: Given that we are designing a system to distin- guish between classes of proteins, it is clear that an ideal solution is adaptive, a property provided by MR techniques. (c) Fast and Efﬁcient Computation: It is well known that wavelets have a computational cost of the order O(n), where n is the input size, as opposed to O(n log n) typical for other linear transforms including the FFT. Our Philosophy. Based on the above arguments, we would like to extract discriminative features within space-frequency localized subspaces. These are obtained by MR decomposition; thus, instead 570 0-7803-9577-8/06/$20.00 ©2006 IEEE ISBI 2006