Performance Evaluation of Existing De Novo Sequencing Algorithms Sergey Pevtsov, †,§ Irina Fedulova, †,§ Hamid Mirzaei, Charles Buck, and Xiang Zhang* ,‡ Department of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Moscow 119992, and Russian Federation and Bindley Bioscience Center, Purdue University, West Lafayette, Indiana 47907 Received May 9, 2006 Two methods have been developed for protein identification from tandem mass spectra: database searching and de novo sequencing. De novo sequencing identifies peptide directly from tandem mass spectra. Among many proposed algorithms, we evaluated the performance of the five de novo sequencing algorithms, AUDENS, Lutefisk, NovoHMM, PepNovo, and PEAKS. Our evaluation methods are based on calculation of relative sequence distance (RSD), algorithm sensitivity, and spectrum quality. We found that de novo sequencing algorithms have different performance in analyzing QSTAR and LCQ mass spectrometer data, but in general, perform better in analyzing QSTAR data than LCQ data. For the QSTAR data, the performance order of the five algorithms is PEAKS > Lutefisk, PepNovo > AUDENS, NovoHMM. The performance of PEAKS, Lutefisk, and PepNovo strongly depends on the spectrum quality and increases with an increase of spectrum quality. However, AUDENS and NovoHMM are not sensitive to the spectrum quality. Compared with other four algorithms, PEAKS has the best sensitivity and also has the best performance in the entire range of spectrum quality. For the LCQ data, the performance order is NovoHMM > PepNovo, PEAKS > Lutefisk > AUDENS. NovoHMM has the best sensitivity, and its performance is the best in the entire range of spectrum quality. But the overall performance of NovoHMM is not significantly different from the performance of PEAKS and PepNovo. AUDENS does not give a good performance in analyzing either QSTAR and LCQ data. Keywords: mass spectrometry peptide identification de novo sequencing mass spectral quality Introduction Protein identification is a key step for proteomics to con- tribute to the understanding of biological systems. Two ap- proaches are generally deployed to interpret experimental MS/ MS datasdatabase searching and de novo sequencing. For database searching, which is most widely employed, protein databases are used to find a peptide for which a theoretically predicted spectrum best matches the experimental data. Dif- ferent scoring schemes are used to evaluate the matches between candidate peptides and the given experimental spectrum. 1-9 The determination of peptide sequences without the help of a protein database is called de novo sequencing. Identifica- tion of new proteins, proteins resulting from mutations, and proteins with chemical modifications previously not described (unexpected modifications) all require de novo sequencing. Many de novo sequencing algorithms based on different approaches have been developed; we will briefly describe some of them here. The exhaustive listing algorithm first finds all possible candidate sequences according to the mass of parent ion. 10,11 All generated candidate sequences are compared with the experimental spectrum to determine the best conformity. Because of paramount inexact parent mass determination, the main difficulty of this approach is the exponential growth in the number of possible candidate sequences with the increase of parent ion mass. Since the measurement of mass accuracy improves, the composition-based sequencing approach is very promising. For example, there are only about 100 possible peptide candidates with a mass of 1347 Da when the mass accuracy of the mass spectrometer is better than (1 ppm. 12 Another approach involves examination of only small parts of the sequence following extension by adding amino acids at both sides of the small sequence tag until the full mass is accumulated. 13-16 A disadvantage of this approach is possible discard of good peptides during subsequence extension caused by absence of some ions in the tandem spectrum. The main idea of the graph-theoretical approach is to represent the spectrum with a “spectrum graph”. 17 Each peak in the spectrum corresponds to a vertex in the graph, and two vertexes are connected by an edge if the mass difference between them equals the mass of one or several amino acids. Vertexes corresponding to N- and C-termini are usually added to the graph. Finding the path from the N- to C-terminus provides a method to generate candidate sequences which are * To whom correspondence should be addressed. Xiang Zhang, Bindley Bioscience Center, Purdue University, West Lafayette, IN 47906. Tel: (765) 496-1153. E-mail: zhang100@purdue.edu. Lomonosov Moscow State University. § These authors made equal contribution to this work. Purdue University. 3018 Journal of Proteome Research 2006, 5, 3018-3028 10.1021/pr060222h CCC: $33.50 2006 American Chemical Society Published on Web 10/06/2006