Computational Biology and Chemistry 59 (2015) 1–7 Contents lists available at ScienceDirect Computational Biology and Chemistry jo ur nal ho me pag e: www.elsevier.com/locate/compbiolchem Research Article Machine Learnable Fold Space Representation based on Residue Cluster Classes Ricardo Corral-Corral a , Edgar Chavez b , Gabriel Del Rio a,* a Department of Biochemistry and Structural Biology, Instituto de Fisiologa Celular, Universidad Nacional Autónoma de México, México D. F., México b Centro de Investigación Cientíﬁca y de Educación Superior de Ensenada, México a r t i c l e i n f o Article history: Received 12 November 2014 Received in revised form 17 July 2015 Accepted 25 July 2015 Available online 30 July 2015 a b s t r a c t Motivation: Protein fold space is a conceptual framework where all possible protein folds exist and ideas about protein structure, function and evolution may be analyzed. Classiﬁcation of protein folds in this space is commonly achieved by using similarity indexes and/or machine learning approaches, each with different limitations. Results: We propose a method for constructing a compact vector space model of protein fold space by representing each protein structure by its residues local contacts. We developed an efﬁcient method to statistically test for the separability of points in a space and showed that our protein fold space representation is learnable by any machine-learning algorithm. Availability: An API is freely available at https://code.google.com/p/pyrcc/. © 2015 Elsevier Ltd. All rights reserved 1. Introduction All possible protein folds are assumed to occupy an abstract space referred to as fold space. This fold space has become a con- ceptual framework to unify ideas about protein structures with protein function and protein evolution (Cheng and Brooks, 2013). For instance, it is debated whether this space is discrete or contin- uous (Kolodny et al., 2006; Skolnick et al., 2009; Sadreyev et al., 2009). Relevant to our study is the common use of protein similar- ity measures (e.g., root mean square deviation or RMSD) aimed to infer their proximities in this space (Minary and Levitt, 2008). In this case, the inference derived from such measurements assumes that the proximity of protein folds is the only relevant property to explain protein fold evolution and function. Instead of focusing only on the proximity of protein folds, vector space models have been used to expand the protein fold space rep- resentation. In this space, each protein structure is represented in a ﬁxed dimension space (e . g ., euclidean space) by a point (position vector); adjusting the positions of these vectors by approximating their relative distances to protein similarity measures may derive these position vectors. For example, using sequential structure alignment program scores between each pair of structures as a protein similarity measure (Orengo and Taylor, 1996), followed * Corresponding author. Tel.: +44 000 0000000; fax: +44 000 0000000 E-mail address: gdelrio@ifc.unam.mx (G. Del Rio). by multi dimensional scaling allowed the assignment of positions vectors representing each protein structure (Michie et al., 1996). Using DALI as similarity measure, Holm (Holm and Sander, 1998) showed two-dimensional projections to explore protein neighbors in fold space. DALI has also been used as similarity mea- sure to visualize class distribution and fold usages between two bacterial species (Hou et al., 2003) and to explore protein func- tion assignment based on position on this fold space representation (Hou et al., 2005). Fold space constructions based on protein structure similari- ties have two important limitations. First, the time needed to run a structural comparison for each pair of structures is restrictive: 25000 central processing unit hours were needed for calculating similarities between 1898 protein structures (Hou et al., 2005). Second, the position of each point depends heavily on the set of structures being analyzed, and in such case the inclusion of a single new structure can displace all previously assigned positions; thus, there are as many fold space representations as different protein structures data sets, even adopting a unique similarity score. An alternative fold space representation may be built by assigning the position of the vector considering only the structural features of the protein it represents. In this way, a new structure can ﬁnd its location in this fold space without altering the existing ones. One implementation of a vector space model is FragBag (Budowski-Tal et al., 2010), which represents protein structures in 400 dimensions; here, each component in the vector representing a protein fold corresponds to the number of occurrences of a http://dx.doi.org/10.1016/j.compbiolchem.2015.07.010 1476-9271/© 2015 Elsevier Ltd. All rights reserved.