828 VOLUME 34 NUMBER 8 AUGUST 2016 NATURE BIOTECHNOLOGY
PERSPECTIVE
The potential of the diverse chemistries present in natural
products (NP) for biotechnology and medicine remains
untapped because NP databases are not searchable with raw
data and the NP community has no way to share data other
than in published papers. Although mass spectrometry (MS)
techniques are well-suited to high-throughput characterization
of NP, there is a pressing need for an infrastructure to enable
sharing and curation of data. We present Global Natural
Products Social Molecular Networking (GNPS; http://gnps.
ucsd.edu), an open-access knowledge base for community-wide
organization and sharing of raw, processed or identified tandem
mass (MS/MS) spectrometry data. In GNPS, crowdsourced
curation of freely available community-wide reference MS
libraries will underpin improved annotations. Data-driven
social-networking should facilitate identification of spectra and
foster collaborations. We also introduce the concept of ‘living
data’ through continuous reanalysis of deposited data.
NP from marine and terrestrial environments, including their inhabiting
microorganisms, plants, animals, and humans, are routinely analyzed
using MS. However, a single MS experiment can collect thousands
of MS/MS spectra in minutes
1
, and individual projects can acquire
millions of spectra. These data sets are too large for manual analysis.
Furthermore, comprehensive software and proper computational infra-
structure are not readily available and only low-throughput sharing
of either raw or annotated spectra is feasible, even among members
of the same laboratory. The potentially useful information in MS/MS
data sets can thus remain buried in papers, laboratory notebooks, and
private databases, hindering retrieval, mining, and sharing of data and
knowledge. Although several NP databases—Dictionary of Natural
Products
2
, AntiBase
3
, and MarinLit
4
—assist in dereplication (iden-
tification of known compounds), these resources are not freely avail-
able and do not process MS data. Conversely, MS databases, including
MassBank
5
, Metlin
6
, mzCloud
7
, and ReSpect
8
, host MS/MS spectra but
limit data analyses to several individual spectra or a limited amount of
liquid chromatography (LC)–MS files. Other free online computation
resources that leverage the MS/MS spectra of Metlin, such as those
provided by mzCloud and XCMS Online, are available. However,
neither of those allows free download of its reference library.
Global genomics and proteomics research has been facilitated
by the development of integral resources, such as the US National
Center for Biotechnology Information (NCBI; Bethesda, MD, USA)
and UniProt KnowledgeBase (UniProtKB), which provide robust
platforms for data sharing and knowledge dissemination
9,10
.
Recognizing the need for an analogous community platform to ana-
lyze NP MS data, we present GNPS. GNPS is a data-driven platform
for the storage, analysis, and knowledge dissemination of MS/MS
spectra that enables community sharing of raw spectra, continuous
annotation of deposited data, and collaborative curation of refer-
ence spectra (referred to as spectral libraries) and experimental data
(organized as data sets).
GNPS provides the ability to analyze a data set and to compare it
to all publicly available data. By building on the computational infra-
structure of the University of California San Diego (UCSD) Center
for Computational Mass Spectrometry (CCMS; http://proteomics.
ucsd.edu/), GNPS provides public data set deposition and/or retrieval
through the Mass Spectrometry Interactive Virtual Environment
(MassIVE) data repository. The GNPS analysis infrastructure further
enables online dereplication
6,11–13
, automated molecular network-
ing analysis
14–21
, and crowdsourced MS/MS spectrum curation. Each
data set added to the GNPS repository is automatically reanalyzed
in the next monthly cycle of continuous identification (see ‘Living
data by continuous analysis’ below). Each of these tens of millions of
spectra in GNPS data sets is matched to reference spectral libraries to
annotate molecules and to discover putative analogs (Fig. 1a). From
January 2014 to November 2015, GNPS grew to serve 9,267 users
from 100 countries (Fig. 1b), with 42,486 analysis sessions that have
processed >93 million spectra as molecular networks from a quarter-
million LC–MS runs. Searches against a combined catalog of over
221,000 MS/MS reference library spectra from 18,163 compounds
(Supplementary Table 1) are possible, and GNPS has matched almost
one hundred million MS/MS spectra in all public and private search
jobs using an estimated 84,000 compute hours.
GNPS spectral libraries
GNPS spectral libraries enable dereplication, variable dereplication
(approximate matches to spectra of related molecules), and identifica-
tion of spectra in molecular networks. GNPS has collected available
MS/MS spectral libraries relevant to NP (which also include other
metabolites and molecules), including MassBank
5
, ReSpect
8
, and
NIST
22
(Table 1, Fig. 2a and Supplementary Table 1). Altogether, these
third-party libraries total 212,230 MS/MS spectra representing 12,694
unique compounds (Fig. 2b). Although this combined collection
of reference spectra provides a starting point for dereplication, only
Sharing and community curation of mass
spectrometry data with Global Natural Products
Social Molecular Networking
A full list of authors and affiliations appears at the end of the paper.
Received 11 August 2015; accepted 10 May 2016; published online 9 August 2016; doi:10.1038/nbt.3597
npg
© 2016 Nature America, Inc. All rights reserved.