Future
Medicinal
Chemistry
Editorial
part of
Does ‘Big Data’ exist in medicinal chemistry,
and if so, how can it be harnessed?
Igor V Tetko*
,1,2
, Ola Engkvist
3
& Hongming Chen
3
1
Helmholtz Zentrum München-German
Research Center for Environmental
Health (GmbH), Institute of Structural
Biology, Ingolstädter Landstraße 1, b.
60w, D-85764 Neuherberg, Germany
2
BIGCHEM GmbH, Ingolstädter
Landstraße 1, b. 60w, D-85764
Neuherberg, Germany
3
Discovery Sciences, AstraZeneca R&D
Gothenburg, Pepparedsleden 1, Mölndal,
SE-43183, Sweden
*Author for correspondence:
Tel.: +49 89 3187 3575
Fax: +49 89 3187 3585
itetko@vcclab.org
1801 Future Med. Chem. (2016) 8(15), 1801–1806 ISSN 1756-8919 10.4155/fmc-2016-0163 © Igor V Tetko, Ola Engkvist
& Hongming Chen
First draft submitted: 1 August 2016; Accepted for publication: 12 August 2016;
Published online: 15 September 2016
Keywords:applicabilitydomain•BigData•chemoinformatics•educationinchemistryand
informatics•localandglobalmodels•multitasklearning•neuralnetworks•virtualchemical
spaces
The term ‘Big Data’ has gained increasing pop-
ularity within the chemistry field and across
science broadly in recent years [1] . Chemical
databases have seen a dramatic growth over
the past decade, with, for example, ChEMBL,
REAXYS and PubChem providing hundreds
of millions of experimental facts for tens of
millions of compounds [1] . Moreover, even
larger datasets of experimental measurements
are held within in-house data collections
at pharma companies [2] . Overall, the total
number of entries across these databases is in
the range of a billion, 10
9
; however, although
this number may seem impressive, it pales
into comparison relative to other fields [3] ,
where the amount of data is frequently mea-
sured in exabytes, 10
18
. Thus, does Big Data
really exist within the chemistry field? What
are such data within medicinal chemistry
specifically and where do the challenges lie in
analysis of these data? Big Data refer to data
out of the scale of traditional applications,
which require efforts beyond the traditional
analysis [1] . In this article, we will be discuss-
ing how it applies to medicinal chemistry,
as well as providing an overview of some of
the most important trends in the medicinal
chemistry–Big Data field.
Does Big Data exist in medicinal
chemistry?
A dataset could be classified as ‘big’ if techni-
cal resources (speed, memory) are not capable
of analyzing the data, using existing meth-
ods. Big Data in a field like analysis of par-
ticle collision at CERN [3] is driven by physi-
cal challenges (hardware, computer speed
and physical computer memory required to
store and analyze such data), which may be
addressed by the development of new and
more advanced software.
Medicinal chemistry related data are cre-
ated and curated in pharmaceutical industry
via high-throughput screening (HTS) and
drug discovery campaigns and additionally
also available in databases sourced from scien-
tific journals, patents etc. For example, Astra-
Zeneca in-house screening database contains
over 150 million structure–activity relation-
ship (SAR) data point [2] . The HTS data from
pharma companies are usually very sparse and
for each screened target there is only a small
number of active hits. Further developments
are done with a relatively small series of com-
pounds, usually varying from hundreds to
thousands of compounds for those series. Spe-
cialists who work on these target specific data
do not have Big Data in their daily work; tra-
ditional modeling algorithm is well enough to
handle their datasets.
When the focus is on chemogenomics
data, the situation is different. The big-
gest medicinal chemistry data reservoir,
PubChem, currently comprise 91 million
chemical structures and 230 million bio-
activity data points corresponding to over
“ ...further progress will critically depend on training programs and
advances in chemoinformatics, a discipline bridging chemistry and
informatics. ”
SPECIAL FOCUS y Computational chemistry & computer-aided drug discovery – Part II
For reprint orders, please contact reprints@future-science.com