The Paragon Algorithm, a Next Generation Search Engine That Uses Sequence Temperature Values and Feature Probabilities to Identify Peptides from Tandem Mass Spectra* □ S Ignat V. Shilov‡, Sean L. Seymour‡§, Alpesh A. Patel, Alex Loboda, Wilfred H. Tang, Sean P. Keating¶, Christie L. Hunter, Lydia M. Nuwaysir, and Daniel A. Schaeffer The Paragon TM Algorithm, a novel database search en- gine for the identification of peptides from tandem mass spectrometry data, is presented. Sequence Temperature Values are computed using a sequence tag algorithm, allowing the degree of implication by an MS/MS spectrum of each region of a database to be determined on a con- tinuum. Counter to conventional approaches, features such as modifications, substitutions, and cleavage events are modeled with probabilities rather than by discrete user-controlled settings to consider or not consider a feature. The use of feature probabilities in conjunction with Sequence Temperature Values allows for a very large increase in the effective search space with only a very small increase in the actual number of hypotheses that must be scored. The algorithm has a new kind of user interface that removes the user expertise requirement, presenting control settings in the language of the labora- tory that are translated to optimal algorithmic settings. To validate this new algorithm, a comparison with Mascot is presented for a series of analogous searches to explore the relative impact of increasing search space probed with Mascot by relaxing the tryptic digestion conform- ance requirements from trypsin to semitrypsin to no en- zyme and with the Paragon Algorithm using its Rapid mode and Thorough mode with and without tryptic spec- ificity. Although they performed similarly for small search space, dramatic differences were observed in large search space. With the Paragon Algorithm, hundreds of biological and artifact modifications, all possible substi- tutions, and all levels of conformance to the expected digestion pattern can be searched in a single search step, yet the typical cost in search time is only 2–5 times that of conventional small search space. Despite this large in- crease in effective search space, there is no drastic loss of discrimination that typically accompanies the explora- tion of large search space. Molecular & Cellular Pro- teomics 6:1638 –1655, 2007. This study presents a new software technology for the identification of peptides from tandem mass spectra called the Paragon TM Algorithm, hereafter referred to interchange- ably as “Paragon.” The most common application for this class of software tools is so-called “shotgun” or “bottom-up” proteomics experiments (1) where a protein mixture of any complexity is digested with a proteolytic enzyme or reagent, the peptides are analyzed by tandem mass spectrometry, and then software of this type is used to identify the peptides (2, 3) and, by inference, determine which proteins have been de- tected in the mixture (4). 1 Although it is currently much less common, this type of software can also be applied to the direct analysis of endogenous peptides that result from the natural proteolysis in an organism (5–10). The Paragon Algo- rithm and this study specifically focus on the peptide identi- fication process. This search engine is part of a larger pack- age called ProteinPilot TM Software, which uses the peptide identification approach described here and then automatically conducts protein inference analysis with the Pro Group TM Algorithm discussed elsewhere. 1–3 Protein identification for the analysis of MS/MS fragmenta- tion data in the bottom-up approach can be thought of as having four main stages: 1) preprocessing, 2) selection of peptide hypotheses, 3) scoring peptide hypotheses, and 4) protein inference. The preprocessing stage 1 can include conversion of raw data to simplified peak lists, averaging of spectra deemed sufficiently similar, filtering of spectra con- sidered unlikely to yield a good identification, etc. Most tools fall into one of two main categories differing in how hypoth- eses are selected: sequence approaches use some de novo From Applied Biosystems/MDS Sciex, Foster City, California 94404 Received, September 12, 2006, and in revised form, May 26, 2007 Published, MCP Papers in Press, May 27, 2007, DOI 10.1074/ mcp.T600050-MCP200 1 Seymour, S. L., Loboda, A., Tang, W. H., Nimkar, S., and Schaef- fer, D. A. (2004) Poster presented at the 52nd ASMS Conference on Mass Spectrometry and Allied Topics, Nashville, TN (May 23–27, 2004). 2 Seymour, S. L. (2005) PowerPoint presentation at the MCP Work- shop: Criteria for Publication of Proteomic Data, Paris, France (May 12–13, 2005) (www.mcponline.org/misc/PariReport_PP.shtml). 3 S. L. Seymour, A. Loboda, W. H. Tang, A. A. Patel, I. V. Shilov, and D. A. Schaeffer, manuscript in preparation. Technology © 2007 by The American Society for Biochemistry and Molecular Biology, Inc. 1638 Molecular & Cellular Proteomics 6.9 This paper is available on line at http://www.mcponline.org