Peptide retention prediction applied to proteomic data analysis Martin Gilar 1 * , Aleksander Jaworski 2 , Petra Olivova 1 and John C. Gebler 1 1 Waters Corporation, 34 Maple Street, Milford, MA 01757, USA 2 51 Palomino Drive, Franklin, MA 02038, USA Received 12 March 2007; Revised 13 June 2007; Accepted 24 June 2007 A retention prediction model was developed for peptides separated in reversed-phase chromatog- raphy. The model was utilized to identify and exclude the false positive (FP) peptide identifications obtained via database search. The selected database included human proteins, as well as decoy sequences of random proteins. The FP peptide detection rate was defined either as number of retention time outliers, or random decoy sequence identifications. The FP rate for various MASCOT scores was calculated. The peptides identified in one-dimensional (1D) and two-dimensional (2D) liquid chromatography/mass spectrometry (LC/MS) experiments were validated by prediction models. Multi-dimensional LC was based on two orthogonal reversed-phase chromatography modes; prediction models were successfully applied for data filtering in both separation dimensions. Copyright # 2007 John Wiley & Sons, Ltd. In spite of the advances in analytical instrumentation and bioinformatics software development, proteomic analysis using liquid chromatography/mass spectrometry (LC/MS) remains a daunting task. The primary reasons are: (i) Extreme sample complexity, resulting in component overlap on the LC 1 as well as the MS level. 2 This complicates data deconvolution and protein identification via the database search. (ii) Dynamic range of sample concentration typically exceeds the linearity of MS instruments. Only the proteins/ peptides present at detectable levels can be reliably identified. 3–5 (iii) Some ambiguity is associated with peptide identifications. Database search algorithms and search criteria require further refinements. 6,7 Users have to manually define the search criteria yielding an acceptable protein identification error rate. One of the most challenging proteomic samples is serum or plasma. While the agreement between laboratories is good for high abundant proteins, the medium and low abundant species are difficult to identify with confidence. These proteins are typically detected by a single or few peptides of low intensity; therefore, the variability between published reports is high. In 2005 the Human Proteome Organization (HUPO) distributed a uniform sample and compiled analysis results from 35 participating laboratories. 8 A subset of 3020 proteins was compiled, representing the detected protein plasma/serum proteins. The protein lists generated by participating laboratories generally overlapped with this database in only 15–20% of cases. Some variability was likely introduced by the method 9 (gels vs. 1D LC vs. 2D LC) used for analysis, some by various bioinformatics software. 6 Nevertheless, the lack of reproducibility and repeatability in this study suggests that many reported proteins were false positive (FP) hits. The FP identification rate can be reduced by elevating the search score(s) cut-off level. Understandably, this reduces the number of correctly identified proteins as well. 10 The discussion of what is the acceptable level of FP hits on the protein and peptide level and how to reduce it is currently ongoing in the scientific community. 11–13 Several approaches have been developed to estimate the error rate in proteomic experiments. The FP rate was defined as the percentage of proteins identified via a search against a decoy database (comprising either reversed or randomized peptide sequences). 14–17 A principle goal of decoy databases is to define the search criteria that will maximize the number of identified species, while maintaining an acceptable error rate (1–5%). Several laboratories have implemented additional data filtering based on physicochemical properties of peptides deduced from their sequence. For example, the correlation between calculated peptide pI and the fraction number (position in the gel strip) in isoelectric focusing (IEF) was utilized to remove FP identifications from large proteomic data sets. 16,18–20 It has been known for decades that peptide retention in reversed-phase (RP)-LC can be predicted from their sequence. 21–26 The early reports inspired the development of more robust retention prediction models and their application to proteomic data analysis. 27–31 In this study we use the developed peptide retention prediction model to evaluate the FP rate of peptide identification in a large data set. The FP estimate based on RAPID COMMUNICATIONS IN MASS SPECTROMETRY Rapid Commun. Mass Spectrom. 2007; 21: 2813–2821 Published online in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/rcm.3150 *Correspondence to: M. Gilar, Waters Corporation, 34 Maple St., Milford, MA 01757, USA. E-mail: Martin_Gilar@waters.com Copyright # 2007 John Wiley & Sons, Ltd.