An Adaptive Pipeline To Maximize Isobaric Tagging Data in Large- Scale MS-Based Proteomics John Corthe ́ sy, †,⊥ Konstantinos Theoﬁlatos, ‡,⊥ Seferina Mavroudi, ‡,§ Charlotte Macron, † Ornella Cominetti, † Mona Remlawi, † Francesco Ferraro, † Antonio Nú ñ ez Galindo, † Martin Kussmann, †,# Spiridon Likothanassis,* ,‡,∥ and Loïc Dayon* ,† † Nestle ́ Institute of Health Sciences, Lausanne 1015, Switzerland ‡ InSybio, Ltd., Innovations House, 19 Staple Gardens, Winchester SO238SR, United Kingdom § Department of Social Work, School of Sciences of Health and Care, Technological Educational Institute of Western Greece, Patras 26334, Greece ∥ Department of Computer Engineering and Informatics, University of Patras, Patras 26500, Greece * S Supporting Information ABSTRACT: Isobaric tagging is the method of choice in mass-spectrometry-based proteomics for comparing several conditions at a time. Despite its multiplexing capabilities, some drawbacks appear when multiple experiments are merged for comparison in large sample-size studies due to the presence of missing values, which result from the stochastic nature of the data-dependent acquisition mode. Another indirect cause of data incompleteness might derive from the proteomic-typical data- processing workﬂow that ﬁrst identiﬁes proteins in individual experiments and then only quantiﬁes those identiﬁed proteins, leaving a large number of unmatched spectra with quantitative information unexploited. Inspired by untargeted metabolomic and label-free proteomic workﬂows, we developed a quantiﬁcation-driven bioinformatic pipeline (Quantify then Identify (QtI)) that optimizes the processing of isobaric tandem mass tag (TMT) data from large-scale studies. This pipeline includes innovative features, such as peak ﬁltering with a self-adaptive preprocessing pipeline optimization method, Peptide Match Rescue, and Optimized Post-Translational Modiﬁcation. QtI outperforms a classical benchmark workﬂow in terms of quantiﬁcation and identiﬁcation rates, signiﬁcantly reducing missing data while preserving unmatched features for quantitative comparison. The number of unexploited tandem mass spectra was reduced by 77 and 62% for two human cerebrospinal ﬂuid and plasma data sets, respectively. KEYWORDS: algorithms, bioinformatics, biomarkers, discovery, isobaric tagging, machine learning, protein identiﬁcation, quantiﬁcation, tandem mass spectrometry, tandem mass tag ■ INTRODUCTION Mass-spectrometry (MS)-based shotgun proteomics can generate large data sets, composed of millions of tandem mass spectra, that are usually ﬁrst matched by comparison with theoretical fragmentation spectra of deﬁned proteolytic peptide sequences. 1 In many workﬂows, protein identiﬁcation based on spectral matching is followed by quantiﬁcation of the identiﬁed proteins under diﬀerent biological conditions. 2 However, this broadly used data processing pipeline, the so-called “classical workﬂow” in the following, can be revealed to be rather ineﬃcient considering the large amount of unexploited spectral information. For instance, only a fraction of all acquired spectra matched in independent experiments is further used to provide complete quantitative information. The prevalence of the protein identiﬁcation step can be predominantly attributed to historical reasons because of the initial role of MS to identify proteins after gel electrophoresis. 3 Protein identiﬁcation may have inappropriately remained the ﬁrst processing task performed, detrimental to the quantiﬁcation in many work- ﬂows, especially those employing isobaric labeling. The yield of spectral matching in MS-based proteomics is rather limited, and several reports have indicated typical success rates between 20 and 50%, 4,5 due to, for instance, sequence isoforms and variants not present in databases, the presence of unexpected and multiple modiﬁcations, or low tandem spectral information (e.g., low spectral representation or low peak intensity depending on the nature of the peptides but also the timing of their fragmentation during the chromatographic elution). These ﬁgures also apply when isobaric tagging technologies (e.g., isobaric tags for relative and absolute quantitation (iTRAQ) 6 and tandem mass tag (TMT) 7 ) are used for relative protein quantiﬁcation. Importantly, many unmatched tandem mass spectra from such labeling experi- ments harbor information on true peptides possibly assignable Received: February 15, 2018 Published: April 26, 2018 Article pubs.acs.org/jpr Cite This: J. Proteome Res. 2018, 17, 2165-2173 © 2018 American Chemical Society 2165 DOI: 10.1021/acs.jproteome.8b00110 J. Proteome Res. 2018, 17, 2165−2173 Downloaded via NESTEC SA on September 6, 2019 at 13:22:20 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.