Machine learning of chemical reactivity from databases of organic reactions Gonc ¸alo V. S. M. Carrera Æ Sunil Gupta Æ Joa ˜o Aires-de-Sousa Received: 27 November 2008 / Accepted: 18 April 2009 / Published online: 26 May 2009 Ó Springer Science+Business Media B.V. 2009 Abstract Databases of chemical reactions contain knowl- edge about the reactivity of specific reagents. Although information is in general only explicitly available for compounds reported to react, it is possible to derive information about substructures that do not react in the reported reactions. Both types of information (positive and negative) can be used to train machine learning techniques to predict if a compound reacts or not with a specific reagent. The whole process was implemented with two databases of reactions, one involving BuNH 2 as the reagent, and the other NaCNBH 3 . Negative information was derived using MOLMAP molecular descriptors, and classification models were developed with Random Forests also based on MOLMAP descriptors. MOLMAP descrip- tors were based exclusively on calculated physicochemical features of molecules. Correct predictions were achieved for *90% of independent test sets. While NaCNBH 3 is a selective reducing reagent widely used in organic synthe- sis, BuNH 2 is a nucleophile that mimics the reactivity of the lysine side chain (involved in an initiating step of the mechanism leading to skin sensitization). Keywords MOLMAP Á Chemical reactivity Á Databases Á Machine learning Á Electrophilicity Abbreviations MOLMAP MOLecular maps of atom-level properties BuNH 2 Butylamine RF Random forest VOC Volatile organic compounds QSAR Quantitative structure activity relationship OOB Out of bag SVM Support vector machines ROC Receiver operating characteristic SOM Self organizing maps HTS High-throughput screening Introduction Chemoinformatics approaches that learn from available experimental data to make rapid estimations of chemical reactivity are currently sought for various applications in different fields. Chemical reactivity is involved in toxi- cological mechanisms responsible for skin sensitization, [1] mutagenicity, [2] or adverse side effects of drugs [3]. Prediction of reactivity is needed in pharmaceutical R&D innovation processes, or for the prioritization of experi- mental tests in risk assessment of chemicals, namely in relation with the EU REACH [4] legislation. Furthermore, the legislative trend for the abolition of animal testing of cosmetic products [5] is demanding alternative evaluation procedures [6, 11]. For the assessment of skin sensitiza- tion, in vitro reactivity tests, [7–10] as well as QSARs have been proposed. In silico methodologies are of interest also for ‘‘Integrated Testing Strategies’’ that combine dif- ferent types of data and information, e.g. predictions or results obtained from several single tests, in the decision- making process [10]. In the area of eco-toxicology, Electronic supplementary material The online version of this article (doi:10.1007/s10822-009-9275-2) contains supplementary material, which is available to authorized users. G. V. S. M. Carrera Á S. Gupta Á J. Aires-de-Sousa (&) REQUIMTE, CQFB, Departamento de Quı ´mica, Faculdade de Cie ˆncias e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal e-mail: jas@fct.unl.pt 123 J Comput Aided Mol Des (2009) 23:419–429 DOI 10.1007/s10822-009-9275-2