March 22, 2006 11:20 Proceedings Trim Size: 9in x 6in WTimm PEAK INTENSITY PREDICTION FOR PMF MASS SPECTRA USING SUPPORT VECTOR REGRESSION W. TIMM 1,2,3 , S. B ¨ OCKER 2 , T. TWELLMANN 3 , T. W. NATTKEMPER 3 1. International NRW Graduate School in Bioinformatics and Genome Research 2. Junior Research Group Informatics for Mass Spectrometry, Genome Informatics Group, Faculty of Technology 3. Applied Neuroinformatics Group, Faculty of Technology Bielefeld University, Postfach 100131, 33501 Bielefeld, Germany E-mail: wtimm@techfak.uni-bielefeld.de With the increasing amount of data nowadays produced in the ﬁeld of proteomics, automated approaches for reliable protein identiﬁcation are highly desirable. One widely-used approach are protein mass ﬁngerprints (PMFs) that allow database searching for the unknown protein, based on a MALDI-TOF mass spectrum of its tryptic digest. Current approaches and software packages for interpreting PMFs do rarely make use of peak intensities in the measured spectrum, mostly due to the diﬃculty of predicting peak intensities in the simulated mass spectra. In this work, we address the problem of predicting peak intensities in MALDI-TOF mass spectra, and we use regression support vector machines (ν -SVR) for this purpose. We compare the impact of diﬀerent preprocessing and normalization modes such as binning and balancing data sets on prediction accuracy. Our preliminary results indicate that we can predict peak intensities using ν -SVR even from very small data sets. It is reasonable to assume that peak intensity prediction can greatly improve automated peptide identiﬁcation. 1. Introduction Mass spectrometry has become the method of choice to analyze the pro- teome of a cell. One widely-used approach is based on separating proteins via two dimensional electrophoresis, then digesting each protein using an endopeptidase such as trypsin, and ﬁnally analyzing the peptide mixture by MALDI-TOF mass spectrometry. Proteins are identiﬁed by comparison of the resulting protein mass ﬁngerprints (PMFs) with those in a database of known proteins. With the increasing amount of data produced in this area, automated approaches for reliable protein identiﬁcation are highly desirable. Shadforth et al. 1 give an overview of currently available tech- niques. The most established programs for this purpose are ProFound 2 1