Probability-Based Validation of Protein Identifications Using a Modified SEQUEST Algorithm Michael J. MacCoss, ²,‡ Christine C. Wu, ²,‡ and John R. Yates, III* ,²,§ Department of Cell Biology, The Scripps Research Institute, La Jolla, California 92037, and Department of Proteomics and Metabolomics, Torrey Mesa Research Institute, 3115 Merryfield Row, San Diego, California 92121-1125 Database-searching algorithms compatible with shotgun proteomics match a peptide tandem mass spectrum to a predicted mass spectrum for an amino acid sequence within a database. SEQUEST is one of the most common software algorithms used for the analysis of peptide tandem mass spectra by using a cross-correlation (XCorr) scoring routine to match tandem mass spectra to model spectra derived from peptide sequences. To assess a match, SEQUEST uses the difference between the first- and second-ranked sequences (ΔCn). This value is de- pendent on the database size, search parameters, and sequence homologies. In this report, we demonstrate the use of a scoring routine (SEQUEST-NORM) that normal- izes XCorr values to be independent of peptide size and the database used to perform the search. This new scoring routine is used to objectively calculate the percent confi- dence of protein identifications and posttranslational modifications based solely on the XCorr value. Scientists of the postgenomic era are quickly becoming advocates of global analyses of biological systems. This excitement has been ignited by the success of microarray technology, and current trends in the field of proteomics are shifting from traditional two-dimensional gel electrophoresis (2D-GE)-based technology to more efficient global analysis methods. Shotgun proteomics has emerged as a promising strategy because it provides robust and sensitive methods to profile the protein complement within a complex biological sample. 1 This approach uses multidimensional protein identification technology (MudPIT) which incorporates the resolving power of multidimensional high- pressure liquid chromatography (LC/ LC), the sequence analysis capabilities of tandem mass spectrometry, and database searching software. 1,2 A sample for MudPIT analysis is prepared by digesting a complex protein mixture with proteases to produce an even more complex peptide mixture. The peptides are loaded directly onto an LC/ LC column placed in-line with a tandem mass spectrometer. Tandem mass spectra are acquired “on the fly” as peptides are eluted from the column, ionized, and emitted into the mass spectrometer. 1,2 Respective peptide sequences are identified using software that correlates experimentally acquired tandem mass spectra against theoretical spectra predicted from amino acid sequences contained within a sequence database. 3 Peptides are then “assembled” back into proteins using a second software algorithm. 4 This methodology is becoming increasingly routine, user-friendly, and applicable to almost any biological system. However, although data can be generated and searched very rapidly, assessing the quality of these results in an objective manner can often be the analytical bottleneck especially when assessing the presence of a protein based on one tandem mass spectral match to a sequence. Proteomic analyses are dependent on software that searches a sequence database using data derived from mass spectrometry. Current database-searching algorithms are usually divided into one of two categories: (1) software that matches peptide masses to predicted protein peptide maps 5 or (2) software that matches a peptide tandem mass spectrum to a predicted mass spectrum for an amino acid sequence within a database. 3 The first category generally assumes that the peptide masses are derived from samples composed of only a few components using a protease with selected amino acid specificity and, thus, is not compatible with complex protein mixtures. In contrast, the second category is ideally suited to protein mixtures because the data analysis is performed on the single peptide level and, later, assembled back into proteins after the peptide sequences have been identified. The identification of proteins by MudPIT is complicated because the database correlation software must be capable of handling tandem mass spectra from peptides produced without any cleavage specificity. Even if a highly specific enzyme (e.g., trypsin) is used during the sample preparation, the presence of endogenous proteases within the experimental sample results in a significant amount of nonspecific cleavage. Recently, nonspecific proteases have been exploited to produce overlapping peptides * To whom correspondence should be addressed. E-mail: jyates@ scripps.edu. Voice: (858) 784-8862. Fax: (858) 784-8883. † The Scripps Research Institute. ‡ These authors contributed equally to this work. § Torrey Mesa Research Institute. (1) Link, A. J.; Eng, J. K.; Schieltz, D. M.; Carmack, E.; Mize, G. J.; Morris, D. R.; Garvik, B. M.; Yates, J. R., III. Nat. Biotechnol. 1999 , 17, 676-82. (2) Washburn, M. P.; Wolters, D.; Yates, J. R., III. Nat. Biotechnol. 2001 , 19, 242-7. (3) Eng, J. K.; McCormack, A. L.; Yates, J. R., III. J. Am. Soc. Mass Spectrom. 1994 , 5, 976-89. (4) Tabb, D. L.; McDonald, W. H.; Yates, J. R., III. J. Proteome Res. 2002 , 1, 21-6. (5) Yates, J. R., III; Speicher, S.; Griffin, P. R.; Hunkapiller, T. Anal. Biochem. 1993 , 214, 397-408. 10.1021/ac025826t CCC: $22.00 © xxxx American Chemical Society Analytical Chemistry A PAGE EST: 6.2 Published on Web 00/00/0000