Computational Biology and Chemistry 59 (2015) 1–7
Contents lists available at ScienceDirect
Computational Biology and Chemistry
jo ur nal ho me pag e: www.elsevier.com/locate/compbiolchem
Research Article
Machine Learnable Fold Space Representation based on Residue
Cluster Classes
Ricardo Corral-Corral
a
, Edgar Chavez
b
, Gabriel Del Rio
a,*
a
Department of Biochemistry and Structural Biology, Instituto de Fisiologa Celular, Universidad Nacional Autónoma de México, México D. F., México
b
Centro de Investigación Científica y de Educación Superior de Ensenada, México
a r t i c l e i n f o
Article history:
Received 12 November 2014
Received in revised form 17 July 2015
Accepted 25 July 2015
Available online 30 July 2015
a b s t r a c t
Motivation: Protein fold space is a conceptual framework where all possible protein folds exist and ideas
about protein structure, function and evolution may be analyzed. Classification of protein folds in this
space is commonly achieved by using similarity indexes and/or machine learning approaches, each with
different limitations.
Results: We propose a method for constructing a compact vector space model of protein fold space by
representing each protein structure by its residues local contacts. We developed an efficient method
to statistically test for the separability of points in a space and showed that our protein fold space
representation is learnable by any machine-learning algorithm.
Availability: An API is freely available at https://code.google.com/p/pyrcc/.
© 2015 Elsevier Ltd. All rights reserved
1. Introduction
All possible protein folds are assumed to occupy an abstract
space referred to as fold space. This fold space has become a con-
ceptual framework to unify ideas about protein structures with
protein function and protein evolution (Cheng and Brooks, 2013).
For instance, it is debated whether this space is discrete or contin-
uous (Kolodny et al., 2006; Skolnick et al., 2009; Sadreyev et al.,
2009). Relevant to our study is the common use of protein similar-
ity measures (e.g., root mean square deviation or RMSD) aimed to
infer their proximities in this space (Minary and Levitt, 2008). In
this case, the inference derived from such measurements assumes
that the proximity of protein folds is the only relevant property to
explain protein fold evolution and function.
Instead of focusing only on the proximity of protein folds, vector
space models have been used to expand the protein fold space rep-
resentation. In this space, each protein structure is represented in a
fixed dimension space (e . g ., euclidean space) by a point (position
vector); adjusting the positions of these vectors by approximating
their relative distances to protein similarity measures may derive
these position vectors. For example, using sequential structure
alignment program scores between each pair of structures as a
protein similarity measure (Orengo and Taylor, 1996), followed
*
Corresponding author. Tel.: +44 000 0000000; fax: +44 000 0000000
E-mail address: gdelrio@ifc.unam.mx (G. Del Rio).
by multi dimensional scaling allowed the assignment of positions
vectors representing each protein structure (Michie et al., 1996).
Using DALI as similarity measure, Holm (Holm and Sander,
1998) showed two-dimensional projections to explore protein
neighbors in fold space. DALI has also been used as similarity mea-
sure to visualize class distribution and fold usages between two
bacterial species (Hou et al., 2003) and to explore protein func-
tion assignment based on position on this fold space representation
(Hou et al., 2005).
Fold space constructions based on protein structure similari-
ties have two important limitations. First, the time needed to run
a structural comparison for each pair of structures is restrictive:
25000 central processing unit hours were needed for calculating
similarities between 1898 protein structures (Hou et al., 2005).
Second, the position of each point depends heavily on the set of
structures being analyzed, and in such case the inclusion of a single
new structure can displace all previously assigned positions; thus,
there are as many fold space representations as different protein
structures data sets, even adopting a unique similarity score.
An alternative fold space representation may be built by
assigning the position of the vector considering only the structural
features of the protein it represents. In this way, a new structure
can find its location in this fold space without altering the existing
ones. One implementation of a vector space model is FragBag
(Budowski-Tal et al., 2010), which represents protein structures in
400 dimensions; here, each component in the vector representing
a protein fold corresponds to the number of occurrences of a
http://dx.doi.org/10.1016/j.compbiolchem.2015.07.010
1476-9271/© 2015 Elsevier Ltd. All rights reserved.