A Clustering Based Hybrid System for Mass Spectrometry Data Analysis Pengyi Yang 1 and Zili Zhang 1,2 1 Intelligent Software and Software Engineering Laboratory, Faculty of Computer and Information Science, Southwest University, Chongqing 400715, China 2 School of Engineering and Information Technology, Deakin University, Geelong, Victoria 3217, Australia zzhang@deakin.edu.au Abstract. Recently, much attention has been given to the mass spec- trometry (MS) technology based disease classification, diagnosis, and protein-based biomarker identification. Similar to microarray based in- vestigation, proteomic data generated by such kind of high-throughput experiments are often with high feature-to-sample ratio. Moreover, bio- logical information and pattern are compounded with data noise, redun- dancy and outliers. Thus, the development of algorithms and procedures for the analysis and interpretation of such kind of data is of paramount importance. In this paper, we propose a hybrid system for analyzing such high dimensional data. The proposed method uses the k-mean cluster- ing algorithm based feature extraction and selection procedure to bridge the filter selection and wrapper selection methods. The potential infor- mative mass/charge (m/z) markers selected by filters are subject to the k-mean clustering algorithm for correlation and redundancy reduction, and a multi-objective Genetic Algorithm selector is then employed to identify discriminative m/z markers generated by k-mean clustering al- gorithm. Experimental results obtained by using the proposed method indicate that it is suitable for m/z biomarker selection and MS based sample classification. 1 Introduction With the development of high-throughput proteomic technologies such as mass spectrometry (MS), we are now able to detect and discriminate disease patterns in complex mixtures of proteins derived from biological fluids such as serum, urine or nipple aspirate fluid [1,2]. The technologies commonly employed in such kind of differential studies are time-of-flight (TOF) spectroscopy with matrix- assisted or surface-enhanced laser desorption/ionization (SELDI) or SELDI-TOF [3,4]. Similar to microarray studies, SELDI-TOF datasets consist of tens of thou- sands of mass/charge (m/z) ratios per specimen [5,6]. Each m/z value of the spectrum approximately reflects the abundance of peptides of certain mass [7]. Despite of its great promise, the analysis of the data generated by such studies presented several major challenges. The challenges originate from the nature that M. Chetty, A. Ngom, and S. Ahmad (Eds.): PRIB 2008, LNBI 5265, pp. 98–109, 2008. c Springer-Verlag Berlin Heidelberg 2008