Predicting Molecular Formulas of Fragment Ions with Isotope Patterns in Tandem Mass Spectra Jingfen Zhang, Wen Gao, Jinjin Cai, Simin He, Rong Zeng, and Runsheng Chen Abstract—A number of different approaches have been proposed to predict elemental component formulas (or molecular formulas) of molecular ions in low and medium resolution mass spectra. Most of them rely on isotope patterns, enumerate all possible formulas for an ion, and exclude certain formulas violating chemical constraints. However, these methods cannot be well generalized to the component prediction of fragment ions in tandem mass spectra. In this paper, a new method, FFP (Fragment ion Formula Prediction), is presented to predict elemental component formulas of fragment ions. In the FFP method, the prediction of the best formulas is converted into the minimization of the distance between theoretical and observed isotope patterns. And, then, a novel local search model is proposed to generate a set of candidate formulas efficiently. After the search, FFP applies a new multiconstraint filtering to exclude as many invalid and improbable formulas as possible. FFP is experimentally compared with the previous enumeration methods, and shown to outperform them significantly. The results of this paper can help to improve the reliability of de novo in the identification of peptide sequences. Index Terms—Isotope patterns, peptide sequencing, tandem mass spectra. æ 1 INTRODUCTION T ANDEM mass spectrometry (MS/MS) is an essential and reliable tool for biologists to identify peptide and protein sequences. Many approaches have been designed for peptide sequencing [1], [2], [3], [4], [5], [6], [7]. One important class of these approaches is de novo sequencing, which directly derives a (partial) sequence from experi- mental spectrum [4], [5], [6], [7]. Due to the measurement errors in medium resolution spectrometry, a massive number of candidates will be produced, resulting in identification confusion. The isotope patterns of ions obtained in tandem spectrum can be used toward resolving this confusion. In spectrum, a peak corresponds to an ion with an elemental component formula (i.e., molecular formula). As it will be shown later, the presented isotope patterns can help to accurately predict the elemental component formulas of ions which, in turn, can help to improve the reliability of de novo method. In this paper, we focus on predicting ions’ elemental component formulas from isotope patterns. Several methods and programs, e.g., [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], have been developed to predict the elemental component formulas of molecular ions using isotope patterns. In simple cases, the maximal and minimal numbers of each element are estimated by the intensities of the isotope peaks. Then, constraints of rings and double bonds are used to rule out certain improbable formulas [8], [9], [10], [11]. While, in more sophisticated cases [12], [13], [14], [15], [16], [17], the most possible formulas are predicted by comparing the theoretical isotope patterns with the experimental ones. More specifically, these methods deal with molecular ions containing ele- ments C, H, N, O, S, Si, O, Cl, B, Br, and so on. They usually include three processes: 1) candidate generation, which exhaustively enumerates all possible elemental component formulas corresponding to a given mass and tolerant mass error, 2) filtering, which excludes formulas violating chemical constraints, and 3) matching, which compares the calculated theoretical isotope pattern of each remaining formula with the observed one. The best match is regarded as the most probable formulas. These methods [12], [13], [14], [15], [16], [17] are capable of providing adequate solutions in some cases. However, they cannot be well generalized to the prediction of component formulas of peptide fragment ions for the following reasons. First, the fragment ions in MS/MS spectra are more complex than molecular ions (e.g., fragment ions have no general rings and double-bands constraints any more). Second, the number of possible formulas increases exponentially with the ions’ masses. Due to the difficulty in differentiating the massive number of competing formulas, the reliability of identification and prediction will dramatically decrease. Furthermore, the computing time needed for the exhaustive enumeration and comparison of elemental component formulas will also become prohibitorily high. Therefore, these methods are not IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 3, JULY-SEPTEMBER 2005 217 . J. Zhang, W. Gao, J. Cai, and S. He are with the Institute of Computing Technology, Chinese Academy of Sciences, JDL, Room 701, Power Creative A, No. 1, Shangdi East Road, Haidian District, Beijing, 100080, P.R. China. E-mail: {jfzhang, wgao, jjcai, smhe}@jdl.ac.cn. . R. Zeng is with the Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Science, Chinese Academy of Sciences, 320 Yue Yang Road, Shanghai, 2,00031, P.R. China. E-mail: zr@sibs.ac.cn. . R. Chen is with the Institute of Biophysics, Chinese Academy of Sciences, 15 Datun Road, Chaoyang District, Beijing, 100101, P.R. China. E-mail: crs@ict.ac.cn. Manuscript received 8 Sept. 2004; revised 4 Dec. 2004; accepted 13 Dec. 2004; published online 31 Aug. 2005. For information on obtaining reprints of this article, please send e-mail to: tcbb@computer.org, and reference IEEECS Log Number TCBB-0139-0904. 1545-5963/05/$20.00 ß 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM