The Paragon Algorithm, a Next Generation
Search Engine That Uses Sequence
Temperature Values and Feature Probabilities
to Identify Peptides from Tandem Mass
Spectra*
□ S
Ignat V. Shilov‡, Sean L. Seymour‡§, Alpesh A. Patel, Alex Loboda, Wilfred H. Tang,
Sean P. Keating¶, Christie L. Hunter, Lydia M. Nuwaysir, and Daniel A. Schaeffer
The Paragon
TM
Algorithm, a novel database search en-
gine for the identification of peptides from tandem mass
spectrometry data, is presented. Sequence Temperature
Values are computed using a sequence tag algorithm,
allowing the degree of implication by an MS/MS spectrum
of each region of a database to be determined on a con-
tinuum. Counter to conventional approaches, features
such as modifications, substitutions, and cleavage events
are modeled with probabilities rather than by discrete
user-controlled settings to consider or not consider a
feature. The use of feature probabilities in conjunction
with Sequence Temperature Values allows for a very large
increase in the effective search space with only a very
small increase in the actual number of hypotheses that
must be scored. The algorithm has a new kind of user
interface that removes the user expertise requirement,
presenting control settings in the language of the labora-
tory that are translated to optimal algorithmic settings. To
validate this new algorithm, a comparison with Mascot is
presented for a series of analogous searches to explore
the relative impact of increasing search space probed
with Mascot by relaxing the tryptic digestion conform-
ance requirements from trypsin to semitrypsin to no en-
zyme and with the Paragon Algorithm using its Rapid
mode and Thorough mode with and without tryptic spec-
ificity. Although they performed similarly for small search
space, dramatic differences were observed in large
search space. With the Paragon Algorithm, hundreds of
biological and artifact modifications, all possible substi-
tutions, and all levels of conformance to the expected
digestion pattern can be searched in a single search step,
yet the typical cost in search time is only 2–5 times that of
conventional small search space. Despite this large in-
crease in effective search space, there is no drastic loss
of discrimination that typically accompanies the explora-
tion of large search space. Molecular & Cellular Pro-
teomics 6:1638 –1655, 2007.
This study presents a new software technology for the
identification of peptides from tandem mass spectra called
the Paragon
TM
Algorithm, hereafter referred to interchange-
ably as “Paragon.” The most common application for this
class of software tools is so-called “shotgun” or “bottom-up”
proteomics experiments (1) where a protein mixture of any
complexity is digested with a proteolytic enzyme or reagent,
the peptides are analyzed by tandem mass spectrometry, and
then software of this type is used to identify the peptides (2, 3)
and, by inference, determine which proteins have been de-
tected in the mixture (4).
1
Although it is currently much less
common, this type of software can also be applied to the
direct analysis of endogenous peptides that result from the
natural proteolysis in an organism (5–10). The Paragon Algo-
rithm and this study specifically focus on the peptide identi-
fication process. This search engine is part of a larger pack-
age called ProteinPilot
TM
Software, which uses the peptide
identification approach described here and then automatically
conducts protein inference analysis with the Pro Group
TM
Algorithm discussed elsewhere.
1–3
Protein identification for the analysis of MS/MS fragmenta-
tion data in the bottom-up approach can be thought of as
having four main stages: 1) preprocessing, 2) selection of
peptide hypotheses, 3) scoring peptide hypotheses, and 4)
protein inference. The preprocessing stage 1 can include
conversion of raw data to simplified peak lists, averaging of
spectra deemed sufficiently similar, filtering of spectra con-
sidered unlikely to yield a good identification, etc. Most tools
fall into one of two main categories differing in how hypoth-
eses are selected: sequence approaches use some de novo
From Applied Biosystems/MDS Sciex, Foster City, California 94404
Received, September 12, 2006, and in revised form, May 26, 2007
Published, MCP Papers in Press, May 27, 2007, DOI 10.1074/
mcp.T600050-MCP200
1
Seymour, S. L., Loboda, A., Tang, W. H., Nimkar, S., and Schaef-
fer, D. A. (2004) Poster presented at the 52nd ASMS Conference on
Mass Spectrometry and Allied Topics, Nashville, TN (May 23–27,
2004).
2
Seymour, S. L. (2005) PowerPoint presentation at the MCP Work-
shop: Criteria for Publication of Proteomic Data, Paris, France (May
12–13, 2005) (www.mcponline.org/misc/PariReport_PP.shtml).
3
S. L. Seymour, A. Loboda, W. H. Tang, A. A. Patel, I. V. Shilov, and
D. A. Schaeffer, manuscript in preparation.
Technology
© 2007 by The American Society for Biochemistry and Molecular Biology, Inc. 1638 Molecular & Cellular Proteomics 6.9
This paper is available on line at http://www.mcponline.org