An Adaptive Pipeline To Maximize Isobaric Tagging Data in Large-
Scale MS-Based Proteomics
John Corthe ́ sy,
†,⊥
Konstantinos Theofilatos,
‡,⊥
Seferina Mavroudi,
‡,§
Charlotte Macron,
†
Ornella Cominetti,
†
Mona Remlawi,
†
Francesco Ferraro,
†
Antonio Nú ñ ez Galindo,
†
Martin Kussmann,
†,#
Spiridon Likothanassis,*
,‡,∥
and Loïc Dayon*
,†
†
Nestle ́ Institute of Health Sciences, Lausanne 1015, Switzerland
‡
InSybio, Ltd., Innovations House, 19 Staple Gardens, Winchester SO238SR, United Kingdom
§
Department of Social Work, School of Sciences of Health and Care, Technological Educational Institute of Western Greece, Patras
26334, Greece
∥
Department of Computer Engineering and Informatics, University of Patras, Patras 26500, Greece
* S Supporting Information
ABSTRACT: Isobaric tagging is the method of choice in mass-spectrometry-based
proteomics for comparing several conditions at a time. Despite its multiplexing
capabilities, some drawbacks appear when multiple experiments are merged for
comparison in large sample-size studies due to the presence of missing values, which
result from the stochastic nature of the data-dependent acquisition mode. Another
indirect cause of data incompleteness might derive from the proteomic-typical data-
processing workflow that first identifies proteins in individual experiments and then only
quantifies those identified proteins, leaving a large number of unmatched spectra with
quantitative information unexploited. Inspired by untargeted metabolomic and label-free
proteomic workflows, we developed a quantification-driven bioinformatic pipeline
(Quantify then Identify (QtI)) that optimizes the processing of isobaric tandem mass
tag (TMT) data from large-scale studies. This pipeline includes innovative features, such
as peak filtering with a self-adaptive preprocessing pipeline optimization method,
Peptide Match Rescue, and Optimized Post-Translational Modification. QtI outperforms a classical benchmark workflow in
terms of quantification and identification rates, significantly reducing missing data while preserving unmatched features for
quantitative comparison. The number of unexploited tandem mass spectra was reduced by 77 and 62% for two human
cerebrospinal fluid and plasma data sets, respectively.
KEYWORDS: algorithms, bioinformatics, biomarkers, discovery, isobaric tagging, machine learning, protein identification,
quantification, tandem mass spectrometry, tandem mass tag
■
INTRODUCTION
Mass-spectrometry (MS)-based shotgun proteomics can
generate large data sets, composed of millions of tandem
mass spectra, that are usually first matched by comparison with
theoretical fragmentation spectra of defined proteolytic peptide
sequences.
1
In many workflows, protein identification based on
spectral matching is followed by quantification of the identified
proteins under different biological conditions.
2
However, this
broadly used data processing pipeline, the so-called “classical
workflow” in the following, can be revealed to be rather
inefficient considering the large amount of unexploited spectral
information. For instance, only a fraction of all acquired spectra
matched in independent experiments is further used to provide
complete quantitative information. The prevalence of the
protein identification step can be predominantly attributed to
historical reasons because of the initial role of MS to identify
proteins after gel electrophoresis.
3
Protein identification may
have inappropriately remained the first processing task
performed, detrimental to the quantification in many work-
flows, especially those employing isobaric labeling.
The yield of spectral matching in MS-based proteomics is
rather limited, and several reports have indicated typical success
rates between 20 and 50%,
4,5
due to, for instance, sequence
isoforms and variants not present in databases, the presence of
unexpected and multiple modifications, or low tandem spectral
information (e.g., low spectral representation or low peak
intensity depending on the nature of the peptides but also the
timing of their fragmentation during the chromatographic
elution). These figures also apply when isobaric tagging
technologies (e.g., isobaric tags for relative and absolute
quantitation (iTRAQ)
6
and tandem mass tag (TMT)
7
) are
used for relative protein quantification. Importantly, many
unmatched tandem mass spectra from such labeling experi-
ments harbor information on true peptides possibly assignable
Received: February 15, 2018
Published: April 26, 2018
Article
pubs.acs.org/jpr
Cite This: J. Proteome Res. 2018, 17, 2165-2173
© 2018 American Chemical Society 2165 DOI: 10.1021/acs.jproteome.8b00110
J. Proteome Res. 2018, 17, 2165−2173
Downloaded via NESTEC SA on September 6, 2019 at 13:22:20 (UTC).
See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.