Mass spectrometrists should search for all peptides, but 1 assess only the ones they care about 2 Adriaan Sticker 1, 2, 3, 4 , Lennart Martens 2, 3, 4,* , and Lieven Clement 1, 4,* 3 1 Department of Applied Mathematics, Computer Science & Statistics, Ghent 4 University, Belgium 5 2 Medical Biotechnology Center, VIB, Ghent, Belgium 6 3 Department of Biochemistry, Ghent University, Ghent, Belgium 7 4 Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium 8 * These authors contributed equally to this work and are joint corresponding 9 author, email: lennart.martens@vib-ugent.be & lieven.clement@ugent.be 10 August 12, 2016 11 Abstract 12 In shotgun proteomics identified mass spectra that are deemed irrelevant to the scientific 13 hypothesis are often discarded. Noble (2015) 1 therefore urged researchers to remove irrelevant 14 peptides from the database prior to searching to improve statistical power. We here however, 15 argue that both the classical as well as Noble’s revised method produce suboptimal peptide 16 identifications and have problems in controlling the false discovery rate (FDR). Instead, we 17 show that searching for all expected peptides, and removing irrelevant peptides prior to FDR 18 calculation results in more reliable identifications at controlled FDR level than the classical 19 strategy that discards irrelevant peptides post FDR calculation, or than Noble’s strategy that 20 discards irrelevant peptides prior to searching. 21 1 Introduction 22 Reliable peptide identification is key to every mass spectrometry-based shotgun proteomics work- 23 flow. The growing concern on reproducibility triggered leading journals to require that all peptide- 24 to-spectrum matches (PSMs) are reported along with an estimate of their statistical confidence. 25 The false discovery rate (FDR), i.e. the expected fraction of incorrect identifications, is a very 26 popular statistic for this purpose. In many experiments, however, researchers want to focus on 27 proteins of particular pathways, or few organisms in a metaproteomics sample. Hence, a large 28 fraction of identified peptides are deemed irrelevant for their scientific hypothesis. Considering 29 all PSMs induces an overwhelming multiple testing problem, which leads to few identifications 30 of relevant peptides and thus to underpowered studies. 2 Currently, there is much debate on the 31 optimal search strategy to boost the statistical power within this context (e.g. Noble, 2015 1 and 32 http://www.matrixscience.com/nl/201603/newsletter.html). 33 The common approach, here referred to as the search-all-assess-all (all-all) strategy, involves (1) 34 searching against all expected peptides in a sample, (2) calculating an FDR for each PSM and, 35 (3) filtering irrelevant peptides from the candidate list. Recently, Noble (2015) 1 pointed out that 36 this strategy is suboptimal because many unnecessary hypotheses are evaluated. Therefore, he 37 1 . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted February 21, 2017. . https://doi.org/10.1101/094581 doi: bioRxiv preprint