Probability-Based Validation of Protein
Identifications Using a Modified SEQUEST
Algorithm
Michael J. MacCoss,
²,‡
Christine C. Wu,
²,‡
and John R. Yates, III*
,²,§
Department of Cell Biology, The Scripps Research Institute, La Jolla, California 92037, and Department of Proteomics and
Metabolomics, Torrey Mesa Research Institute, 3115 Merryfield Row, San Diego, California 92121-1125
Database-searching algorithms compatible with shotgun
proteomics match a peptide tandem mass spectrum to a
predicted mass spectrum for an amino acid sequence
within a database. SEQUEST is one of the most common
software algorithms used for the analysis of peptide
tandem mass spectra by using a cross-correlation (XCorr)
scoring routine to match tandem mass spectra to model
spectra derived from peptide sequences. To assess a
match, SEQUEST uses the difference between the first-
and second-ranked sequences (ΔCn). This value is de-
pendent on the database size, search parameters, and
sequence homologies. In this report, we demonstrate the
use of a scoring routine (SEQUEST-NORM) that normal-
izes XCorr values to be independent of peptide size and
the database used to perform the search. This new scoring
routine is used to objectively calculate the percent confi-
dence of protein identifications and posttranslational
modifications based solely on the XCorr value.
Scientists of the postgenomic era are quickly becoming
advocates of global analyses of biological systems. This excitement
has been ignited by the success of microarray technology, and
current trends in the field of proteomics are shifting from
traditional two-dimensional gel electrophoresis (2D-GE)-based
technology to more efficient global analysis methods. Shotgun
proteomics has emerged as a promising strategy because it
provides robust and sensitive methods to profile the protein
complement within a complex biological sample.
1
This approach
uses multidimensional protein identification technology (MudPIT)
which incorporates the resolving power of multidimensional high-
pressure liquid chromatography (LC/ LC), the sequence analysis
capabilities of tandem mass spectrometry, and database searching
software.
1,2
A sample for MudPIT analysis is prepared by digesting a
complex protein mixture with proteases to produce an even more
complex peptide mixture. The peptides are loaded directly onto
an LC/ LC column placed in-line with a tandem mass spectrometer.
Tandem mass spectra are acquired “on the fly” as peptides are
eluted from the column, ionized, and emitted into the mass
spectrometer.
1,2
Respective peptide sequences are identified using
software that correlates experimentally acquired tandem mass
spectra against theoretical spectra predicted from amino acid
sequences contained within a sequence database.
3
Peptides are
then “assembled” back into proteins using a second software
algorithm.
4
This methodology is becoming increasingly routine,
user-friendly, and applicable to almost any biological system.
However, although data can be generated and searched very
rapidly, assessing the quality of these results in an objective
manner can often be the analytical bottleneck especially when
assessing the presence of a protein based on one tandem mass
spectral match to a sequence.
Proteomic analyses are dependent on software that searches
a sequence database using data derived from mass spectrometry.
Current database-searching algorithms are usually divided into
one of two categories: (1) software that matches peptide masses
to predicted protein peptide maps
5
or (2) software that matches
a peptide tandem mass spectrum to a predicted mass spectrum
for an amino acid sequence within a database.
3
The first category
generally assumes that the peptide masses are derived from
samples composed of only a few components using a protease
with selected amino acid specificity and, thus, is not compatible
with complex protein mixtures. In contrast, the second category
is ideally suited to protein mixtures because the data analysis is
performed on the single peptide level and, later, assembled back
into proteins after the peptide sequences have been identified.
The identification of proteins by MudPIT is complicated
because the database correlation software must be capable of
handling tandem mass spectra from peptides produced without
any cleavage specificity. Even if a highly specific enzyme (e.g.,
trypsin) is used during the sample preparation, the presence of
endogenous proteases within the experimental sample results in
a significant amount of nonspecific cleavage. Recently, nonspecific
proteases have been exploited to produce overlapping peptides
* To whom correspondence should be addressed. E-mail: jyates@ scripps.edu.
Voice: (858) 784-8862. Fax: (858) 784-8883.
†
The Scripps Research Institute.
‡
These authors contributed equally to this work.
§
Torrey Mesa Research Institute.
(1) Link, A. J.; Eng, J. K.; Schieltz, D. M.; Carmack, E.; Mize, G. J.; Morris, D.
R.; Garvik, B. M.; Yates, J. R., III. Nat. Biotechnol. 1999 , 17, 676-82.
(2) Washburn, M. P.; Wolters, D.; Yates, J. R., III. Nat. Biotechnol. 2001 , 19,
242-7.
(3) Eng, J. K.; McCormack, A. L.; Yates, J. R., III. J. Am. Soc. Mass Spectrom.
1994 , 5, 976-89.
(4) Tabb, D. L.; McDonald, W. H.; Yates, J. R., III. J. Proteome Res. 2002 , 1,
21-6.
(5) Yates, J. R., III; Speicher, S.; Griffin, P. R.; Hunkapiller, T. Anal. Biochem.
1993 , 214, 397-408.
10.1021/ac025826t CCC: $22.00 © xxxx American Chemical Society Analytical Chemistry A
PAGE EST: 6.2 Published on Web 00/00/0000