Peptide retention prediction applied to proteomic data analysis Martin Gilar 1 * , Aleksander Jaworski 2 , Petra Olivova 1 and John C. Gebler 1 1 Waters Corporation, 34 Maple Street, Milford, MA 01757, USA 2 51 Palomino Drive, Franklin, MA 02038, USA Received 12 March 2007; Revised 13 June 2007; Accepted 24 June 2007 A retention prediction model was developed for peptides separated in reversed-phase chromatog- raphy. The model was utilized to identify and exclude the false positive (FP) peptide identiﬁcations obtained via database search. The selected database included human proteins, as well as decoy sequences of random proteins. The FP peptide detection rate was deﬁned either as number of retention time outliers, or random decoy sequence identiﬁcations. The FP rate for various MASCOT scores was calculated. The peptides identiﬁed in one-dimensional (1D) and two-dimensional (2D) liquid chromatography/mass spectrometry (LC/MS) experiments were validated by prediction models. Multi-dimensional LC was based on two orthogonal reversed-phase chromatography modes; prediction models were successfully applied for data ﬁltering in both separation dimensions. Copyright # 2007 John Wiley & Sons, Ltd. In spite of the advances in analytical instrumentation and bioinformatics software development, proteomic analysis using liquid chromatography/mass spectrometry (LC/MS) remains a daunting task. The primary reasons are: (i) Extreme sample complexity, resulting in component overlap on the LC 1 as well as the MS level. 2 This complicates data deconvolution and protein identiﬁcation via the database search. (ii) Dynamic range of sample concentration typically exceeds the linearity of MS instruments. Only the proteins/ peptides present at detectable levels can be reliably identiﬁed. 3–5 (iii) Some ambiguity is associated with peptide identiﬁcations. Database search algorithms and search criteria require further reﬁnements. 6,7 Users have to manually deﬁne the search criteria yielding an acceptable protein identiﬁcation error rate. One of the most challenging proteomic samples is serum or plasma. While the agreement between laboratories is good for high abundant proteins, the medium and low abundant species are difﬁcult to identify with conﬁdence. These proteins are typically detected by a single or few peptides of low intensity; therefore, the variability between published reports is high. In 2005 the Human Proteome Organization (HUPO) distributed a uniform sample and compiled analysis results from 35 participating laboratories. 8 A subset of 3020 proteins was compiled, representing the detected protein plasma/serum proteins. The protein lists generated by participating laboratories generally overlapped with this database in only 15–20% of cases. Some variability was likely introduced by the method 9 (gels vs. 1D LC vs. 2D LC) used for analysis, some by various bioinformatics software. 6 Nevertheless, the lack of reproducibility and repeatability in this study suggests that many reported proteins were false positive (FP) hits. The FP identiﬁcation rate can be reduced by elevating the search score(s) cut-off level. Understandably, this reduces the number of correctly identiﬁed proteins as well. 10 The discussion of what is the acceptable level of FP hits on the protein and peptide level and how to reduce it is currently ongoing in the scientiﬁc community. 11–13 Several approaches have been developed to estimate the error rate in proteomic experiments. The FP rate was deﬁned as the percentage of proteins identiﬁed via a search against a decoy database (comprising either reversed or randomized peptide sequences). 14–17 A principle goal of decoy databases is to deﬁne the search criteria that will maximize the number of identiﬁed species, while maintaining an acceptable error rate (1–5%). Several laboratories have implemented additional data ﬁltering based on physicochemical properties of peptides deduced from their sequence. For example, the correlation between calculated peptide pI and the fraction number (position in the gel strip) in isoelectric focusing (IEF) was utilized to remove FP identiﬁcations from large proteomic data sets. 16,18–20 It has been known for decades that peptide retention in reversed-phase (RP)-LC can be predicted from their sequence. 21–26 The early reports inspired the development of more robust retention prediction models and their application to proteomic data analysis. 27–31 In this study we use the developed peptide retention prediction model to evaluate the FP rate of peptide identiﬁcation in a large data set. The FP estimate based on RAPID COMMUNICATIONS IN MASS SPECTROMETRY Rapid Commun. Mass Spectrom. 2007; 21: 2813–2821 Published online in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/rcm.3150 *Correspondence to: M. Gilar, Waters Corporation, 34 Maple St., Milford, MA 01757, USA. E-mail: Martin_Gilar@waters.com Copyright # 2007 John Wiley & Sons, Ltd.