J. Biomedical Science and Engineering, 2008, 1, 133-140
Published Online August 2008 in SciRes. http://www.srpublishing.org/journal/jbise JBiSE
Visualization of protein structure relationships us-
ing constrained twin kernel embedding
Yi Guo
1
, Jun-Bin Gao
2
, Paul Wing Hing Kwan
1
& Kevin Xinsheng Hou
3
1
School of Science and Technology University of New England, Armidale, NSW 2351, Australia.
2
School of Computer Science Charles Sturt University,
Bathurst, NSW 2795, Australia.
3
School of Biological, Biomedical and Molecular Sciences University of New England, Armidale, NSW 2351, Australia.
Correspondence should be addressed to Yi Guo (yguo4@turing.une.edu.au), Jun-Bin Gao (jbgao@csu.edu.au), Paul W. Kwan (kwan@turing.une.edu.au), and
Kevin X. Hou (xhou@une.edu.au).
ABSTRACT
In this paper, a recently proposed dimensional-
ity reduction method called Twin Kernel Em-
bedding (TKE) [10] is applied in 2-dimensional
visualization of protein structure relationships.
By matching the similarity measures of the in-
put and the embedding spaces expressed by
their respective kernels, TKE ensures that both
local and global proximity information are pre-
served simultaneously. Experiments conducted
on a subset of the Structural Classification Of
Protein (SCOP) database confirmed the effec-
tiveness of TKE in preserving the original rela-
tionships among protein structures in the lower
dimensional embedding according to their simi-
larities. This result is expected to benefit sub-
sequent analyses of protein structures and their
functions.
1. INTRODUCTION
Recent years have seen phenomenal advances in dimen-
sionality reduction (DR) methods that are widely applied
in bioinformatics [26, 14, 6, 16], biometrics [19, 4, 15,
17], robotics [11], etc. The rapid growth of DR methods
stems from the need to reduce the complexity of the
problem at hand. The target of these methods is mainly
to find the corresponding counterparts of the input data
in a much lower dimensional space without incurring
significant information loss. The low dimensional repre-
sentation can be used in subsequent procedures such as
classification, pattern recognition, and so on. For exam-
ple, a medium sized protein structure typically has a few
thousands of degrees of freedom which is naturally re-
siding in very high dimensional space. It causes the so-
called “curse of dimensionality” problem which will not
only drastically increase the computational complexity of
the learning algorithms but also require large storage
space, leading to very slow indexing and searching speed
in a large scale database in the sequel. Through DR,
most of the redundant dimensions can be removed and
the degrees of freedom left are then input to the subse-
quent discriminant and classification tasks with consid-
erable simplication.
Another advantage of using DR is the 2- or 3-
dimensional mappings of the original data can be visual-
ized in an Euclidean space that can facilitate interpreting
the relationships among data by the researchers. Nor-
mally, the relationships among a set of protein structures
are typically represented in the form of trees derived by
hierarchical clustering. However, this representation
only provides some hints on the evolutionary distances
between protein structures. This limitation motivates our
applying to the application of DR methods in visualizing
the similarity relationships among protein structures.
In this paper, we will propose a new DR algorithm
called Constrained Twin Kernel Embedding (CTKE)
based on Twin Kernel Embedding (TKE) [10] to achieve
the above target. To provide the necessary background
knowledge, we will first give a brief review of DR meth-
ods on protein structures in the next section, followed by
an introduction of TKE and related topics involved in
this new algorithm. Then the CTKE algorithm will be
introduced, which integrates the similarity matching by
TKE with the objective function used in LS-SVM [23].
Experiments on real protein structure data will be pre-
sented and finally we summarize this paper with a con-
clusion.
2. THE RELATED WORKS
DR methods can be categorized into Linear DR methods
(LDR) such as Principal Component Analysis (PCA)
[13], Linear Discriminant Analysis (LDA) [7] and
NonLinear DR methods (NLDR) such as ISOMAP [25],
Laplacian Eigenmaps (LE) [3], Locally Linear Embed-
ding (LLE) [20], etc. LDR methods have been widely
used in bioinformatics due to their simplicity. For exam-
ple, Teodoro et al. [27] applied PCA to transform the
original high dimensional protein motion data into a
lower dimensional representation that captures the domi-
nant modes of motions of the protein. However, the line-
arity assumption on which linear methods are con-
structed does not hold in most cases. In [5], Das et al.
SciRes Copyright © 2008