Mass spectrometrists should search for all peptides, but 1 assess only the ones they care about 2 Adriaan Sticker 1, 2, 3, 4 , Lennart Martens 2, 3, 4,* , and Lieven Clement 1, 4,* 3 1 Department of Applied Mathematics, Computer Science & Statistics, Ghent 4 University, Belgium 5 2 Medical Biotechnology Center, VIB, Ghent, Belgium 6 3 Department of Biochemistry, Ghent University, Ghent, Belgium 7 4 Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium 8 * These authors contributed equally to this work and are joint corresponding 9 author, email: lennart.martens@vib-ugent.be & lieven.clement@ugent.be 10 August 12, 2016 11 Abstract 12 In shotgun proteomics identiﬁed mass spectra that are deemed irrelevant to the scientiﬁc 13 hypothesis are often discarded. Noble (2015) 1 therefore urged researchers to remove irrelevant 14 peptides from the database prior to searching to improve statistical power. We here however, 15 argue that both the classical as well as Noble’s revised method produce suboptimal peptide 16 identiﬁcations and have problems in controlling the false discovery rate (FDR). Instead, we 17 show that searching for all expected peptides, and removing irrelevant peptides prior to FDR 18 calculation results in more reliable identiﬁcations at controlled FDR level than the classical 19 strategy that discards irrelevant peptides post FDR calculation, or than Noble’s strategy that 20 discards irrelevant peptides prior to searching. 21 1 Introduction 22 Reliable peptide identiﬁcation is key to every mass spectrometry-based shotgun proteomics work- 23 ﬂow. The growing concern on reproducibility triggered leading journals to require that all peptide- 24 to-spectrum matches (PSMs) are reported along with an estimate of their statistical conﬁdence. 25 The false discovery rate (FDR), i.e. the expected fraction of incorrect identiﬁcations, is a very 26 popular statistic for this purpose. In many experiments, however, researchers want to focus on 27 proteins of particular pathways, or few organisms in a metaproteomics sample. Hence, a large 28 fraction of identiﬁed peptides are deemed irrelevant for their scientiﬁc hypothesis. Considering 29 all PSMs induces an overwhelming multiple testing problem, which leads to few identiﬁcations 30 of relevant peptides and thus to underpowered studies. 2 Currently, there is much debate on the 31 optimal search strategy to boost the statistical power within this context (e.g. Noble, 2015 1 and 32 http://www.matrixscience.com/nl/201603/newsletter.html). 33 The common approach, here referred to as the search-all-assess-all (all-all) strategy, involves (1) 34 searching against all expected peptides in a sample, (2) calculating an FDR for each PSM and, 35 (3) ﬁltering irrelevant peptides from the candidate list. Recently, Noble (2015) 1 pointed out that 36 this strategy is suboptimal because many unnecessary hypotheses are evaluated. Therefore, he 37 1 . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted February 21, 2017. . https://doi.org/10.1101/094581 doi: bioRxiv preprint