1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 DOI: 10.1002/minf.201800104 Bimolecular Nucleophilic Substitution Reactions: Predictive Models for Rate Constants and Molecular Reaction Pairs Analysis Timur Gimadiev, [a, b] Timur Madzhidov, [a] Igor Tetko, [c] Ramil Nugmanov, [a] Iury Casciuc, [b] Olga Klimchuk, [b] Andrey Bodrov, [a] Pavel Polishchuk, [d] Igor Antipin, [a] and Alexandre Varnek* [b] Abstract: Here, we report the data visualization, analysis and modeling for a large set of 4830 S N 2 reactions the rate constant of which (logk) was measured at different experimental conditions (solvent, temperature). The reac- tions were encoded by one single molecular graph Condensed Graph of Reactions, which allowed us to use conventional chemoinformatics techniques developed for individual molecules. Thus, Matched Reaction Pairs ap- proach was suggested and used for the analyses of substituents effects on the substrates and nucleophiles reactivity. The data were visualized with the help of the Generative Topographic Mapping approach. Consensus Support Vector Regression (SVR) model for the rate constant was prepared. Unbiased estimation of the model’s perform- ance was made in cross-validation on reactions measured on unique structural transformations. The model’s perform- ance in cross-validation (RMSE = 0.61 logk units) and on the external test set (RMSE = 0.80) is close to the noise in data. Performances of the local models obtained for selected subsets of reactions proceeding in particular solvents or with particular type of nucleophiles were similar to that of the model built on the entire set. Finally, four different definitions of model’s applicability domains for reactions were examined. Keywords: bimolecular nucleophilic substitution reactions · Condensed Graph of Reaction · Matched Reaction Pairs · Support Vector Regression · Generative Topographic Mapping · models applicability domain 1 Introduction Compared to individual molecules, chemical reaction is a complex object because it involves several molecular species of two types (reactants and products) and its yield depends on experimental conditions (solvent, catalyst, temperature). This prevents applying to chemical reactions most of conventional methods designed for the analysis and modeling of individual compounds. This complexity could be reduced using the Condensed Graph of Reaction (CGR) approach [1] representing reaction by a single 2D graph, some sort of pseudomolecule, characterized by both conventional chemical bonds and such called dynamic bonds characterizing chemical transformations. Fragment descriptors generated for CGR can be successfully applied in any chemoinformatics application used a descriptors vector as an input. This approach has successfully been used for similarity searching in reaction space, [2] for data analysis using Generative Topographic Mapping [3] and for QSPR modeling of various kinetic and thermodynamic properties of reactions [4–6] or optimal reaction conditions. [7] On the other hand, some methods of data analysis considering chemical species as molecular graphs were never used for reactions analysis so far. One of these methods, Matched Molecular Pairs, is widely used in medicinal chemistry [8] for the analysis of the effects of replacement of one chemical group with another one. In this paper, we’ll demonstrate how this approach can be extended to chemical reactions represented by their CGR. Another goal of this paper is the development of predictive models for logarithm of rate constant (logk) of bimolecular nucleophilic substitution reaction (S N 2). Nucleo- philic substitution (S N ) is a fundamental class of reactions in [a] T. Gimadiev, T. Madzhidov, R. Nugmanov, A. Bodrov, I. Antipin Laboratory of Chemoinformatics and Molecular Modeling Butlerov Institute of Chemistry Kazan Federal University Kremlyovskaya str. 18, Kazan, Russia [b] T. Gimadiev, I. Casciuc, O. Klimchuk, A. Varnek Laboratoire de ChØmoinformatique, UMR 7140 CNRS UniversitØ de Strasbourg 1, rue Blaise Pascal, 67000 Strasbourg, France E-mail: varnek@unistra.fr [c] I. Tetko Helmholtz Zentrum München – German Research Center for Environmental Health (GmbH) Institute of Structural Biology Ingolstädter Landstraße 1, D-85764 Neuherberg, Germany [d] P. Polishchuk Institute of Molecular and Translational Medicine Faculty of Medicine and Dentistry Palacky University Hne ˇvotínskµ 1333/5, 77900, Olomouc, Czech Republic. Supporting information for this article is available on the WWW under https://doi.org/10.1002/minf.201800104 Full Paper www.molinf.com © 2018 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2018, 37, 1800104 (1 of 15) 1800104