RESEARCH ARTICLE Open Access Is EC class predictable from reaction mechanism? Neetika Nath and John BO Mitchell * Abstract Background: We investigate the relationships between the EC (Enzyme Commission) class, the associated chemical reaction, and the reaction mechanism by building predictive models using Support Vector Machine (SVM), Random Forest (RF) and k-Nearest Neighbours (kNN). We consider two ways of encoding the reaction mechanism in descriptors, and also three approaches that encode only the overall chemical reaction. Both cross-validation and also an external test set are used. Results: The three descriptor sets encoding overall chemical transformation perform better than the two descriptions of mechanism. SVM and RF models perform comparably well; kNN is less successful. Oxidoreductases and hydrolases are relatively well predicted by all types of descriptor; isomerases are well predicted by overall reaction descriptors but not by mechanistic ones. Conclusions: Our results suggest that pairs of similar enzyme reactions tend to proceed by different mechanisms. Oxidoreductases, hydrolases, and to some extent isomerases and ligases, have clear chemical signatures, making them easier to predict than transferases and lyases. We find evidence that isomerases as a class are notably mechanistically diverse and that their one shared property, of substrate and product being isomers, can arise in various unrelated ways. The performance of the different machine learning algorithms is in line with many cheminformatics applications, with SVM and RF being roughly equally effective. kNN is less successful, given the role that non-local information plays in successful classification. We note also that, despite a lack of clarity in the literature, EC number prediction is not a single problem; the challenge of predicting protein function from available sequence data is quite different from assigning an EC classification from a cheminformatics representation of a reaction. Background Encoding enzyme reactions and mechanisms Almost all biological processes proceed at a significant rate only because of enzymes, proteins that catalyse the chemical reactions found in nature. For half a century, enzymes have been annotated using Enzyme Commis- sion (EC) numbers [1]. The scheme is a hierarchical organization of enzyme reactions into six main classes (oxidoreductases, transferases, hydrolases, lyases, iso- merases and ligases), which are then split at a further three hierarchical levels. In general, these successive levels describe the reaction at increasingly fine levels of granularity. The six top level classes are very broad reac- tion types. The second level subclass and third level sub-subclass usually describe the specific bonds or functional groups involved in the reaction. The fourth level serial number defines the actual substrate and therefore the specific chemical reaction catalysed. The EC classification can be conveniently browsed and searched via the ExplorEnz database [2,3], while the official website maintained by the Nomenclature Com- mittee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) [4] is a valuable and regularly updated resource. Numerous other online databases allow the user to explore enzyme structure and function, including the Enzyme Structures Database [5], IntEnz [6], BRENDA [7] and KEGG [8,9]. Our motivation is to investigate the relationship between the reaction mechanism as described in the MACiE [10-13] (Mechanism, Annotation and Classifica- tion in Enzymes) database and the main top-level class of the EC classification. In order to do this, we generate supervised machine learning models to predict EC class from data on the chemical reaction or its mechanism. * Correspondence: jbom@st-andrews.ac.uk Biomedical Sciences Research Complex and EaStCHEM School of Chemistry, Purdie Building, University of St Andrews, North Haugh, St Andrews, Scotland KY16 9ST, UK Nath and Mitchell BMC Bioinformatics 2012, 13:60 http://www.biomedcentral.com/1471-2105/13/60 © 2012 Nath and Mitchell; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.