1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
DOI: 10.1002/minf.201800104
Bimolecular Nucleophilic Substitution Reactions: Predictive
Models for Rate Constants and Molecular Reaction Pairs
Analysis
Timur Gimadiev,
[a, b]
Timur Madzhidov,
[a]
Igor Tetko,
[c]
Ramil Nugmanov,
[a]
Iury Casciuc,
[b]
Olga Klimchuk,
[b]
Andrey Bodrov,
[a]
Pavel Polishchuk,
[d]
Igor Antipin,
[a]
and Alexandre Varnek*
[b]
Abstract: Here, we report the data visualization, analysis
and modeling for a large set of 4830 S
N
2 reactions the rate
constant of which (logk) was measured at different
experimental conditions (solvent, temperature). The reac-
tions were encoded by one single molecular graph –
Condensed Graph of Reactions, which allowed us to use
conventional chemoinformatics techniques developed for
individual molecules. Thus, Matched Reaction Pairs ap-
proach was suggested and used for the analyses of
substituents effects on the substrates and nucleophiles
reactivity. The data were visualized with the help of the
Generative Topographic Mapping approach. Consensus
Support Vector Regression (SVR) model for the rate constant
was prepared. Unbiased estimation of the model’s perform-
ance was made in cross-validation on reactions measured
on unique structural transformations. The model’s perform-
ance in cross-validation (RMSE = 0.61 logk units) and on the
external test set (RMSE = 0.80) is close to the noise in data.
Performances of the local models obtained for selected
subsets of reactions proceeding in particular solvents or
with particular type of nucleophiles were similar to that of
the model built on the entire set. Finally, four different
definitions of model’s applicability domains for reactions
were examined.
Keywords: bimolecular nucleophilic substitution reactions · Condensed Graph of Reaction · Matched Reaction Pairs · Support Vector
Regression · Generative Topographic Mapping · models applicability domain
1 Introduction
Compared to individual molecules, chemical reaction is a
complex object because it involves several molecular
species of two types (reactants and products) and its yield
depends on experimental conditions (solvent, catalyst,
temperature). This prevents applying to chemical reactions
most of conventional methods designed for the analysis
and modeling of individual compounds. This complexity
could be reduced using the Condensed Graph of Reaction
(CGR) approach
[1]
representing reaction by a single 2D
graph, some sort of pseudomolecule, characterized by both
conventional chemical bonds and such called dynamic
bonds characterizing chemical transformations. Fragment
descriptors generated for CGR can be successfully applied
in any chemoinformatics application used a descriptors
vector as an input. This approach has successfully been
used for similarity searching in reaction space,
[2]
for data
analysis using Generative Topographic Mapping
[3]
and for
QSPR modeling of various kinetic and thermodynamic
properties of reactions
[4–6]
or optimal reaction conditions.
[7]
On the other hand, some methods of data analysis
considering chemical species as molecular graphs were
never used for reactions analysis so far. One of these
methods, Matched Molecular Pairs, is widely used in
medicinal chemistry
[8]
for the analysis of the effects of
replacement of one chemical group with another one. In
this paper, we’ll demonstrate how this approach can be
extended to chemical reactions represented by their CGR.
Another goal of this paper is the development of
predictive models for logarithm of rate constant (logk) of
bimolecular nucleophilic substitution reaction (S
N
2). Nucleo-
philic substitution (S
N
) is a fundamental class of reactions in
[a] T. Gimadiev, T. Madzhidov, R. Nugmanov, A. Bodrov, I. Antipin
Laboratory of Chemoinformatics and Molecular Modeling
Butlerov Institute of Chemistry
Kazan Federal University
Kremlyovskaya str. 18, Kazan, Russia
[b] T. Gimadiev, I. Casciuc, O. Klimchuk, A. Varnek
Laboratoire de ChØmoinformatique, UMR 7140 CNRS
UniversitØ de Strasbourg
1, rue Blaise Pascal, 67000 Strasbourg, France
E-mail: varnek@unistra.fr
[c] I. Tetko
Helmholtz Zentrum München – German Research Center for
Environmental Health (GmbH)
Institute of Structural Biology
Ingolstädter Landstraße 1, D-85764 Neuherberg, Germany
[d] P. Polishchuk
Institute of Molecular and Translational Medicine
Faculty of Medicine and Dentistry
Palacky University
Hne ˇvotínskµ 1333/5, 77900, Olomouc, Czech Republic.
Supporting information for this article is available on the WWW
under https://doi.org/10.1002/minf.201800104
Full Paper www.molinf.com
© 2018 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2018, 37, 1800104 (1 of 15) 1800104