J. Biomedical Science and Engineering, 2008, 1, 133-140 Published Online August 2008 in SciRes. http://www.srpublishing.org/journal/jbise JBiSE Visualization of protein structure relationships us- ing constrained twin kernel embedding Yi Guo 1 , Jun-Bin Gao 2 , Paul Wing Hing Kwan 1 & Kevin Xinsheng Hou 3 1 School of Science and Technology University of New England, Armidale, NSW 2351, Australia. 2 School of Computer Science Charles Sturt University, Bathurst, NSW 2795, Australia. 3 School of Biological, Biomedical and Molecular Sciences University of New England, Armidale, NSW 2351, Australia. Correspondence should be addressed to Yi Guo (yguo4@turing.une.edu.au), Jun-Bin Gao (jbgao@csu.edu.au), Paul W. Kwan (kwan@turing.une.edu.au), and Kevin X. Hou (xhou@une.edu.au). ABSTRACT In this paper, a recently proposed dimensional- ity reduction method called Twin Kernel Em- bedding (TKE) [10] is applied in 2-dimensional visualization of protein structure relationships. By matching the similarity measures of the in- put and the embedding spaces expressed by their respective kernels, TKE ensures that both local and global proximity information are pre- served simultaneously. Experiments conducted on a subset of the Structural Classification Of Protein (SCOP) database confirmed the effec- tiveness of TKE in preserving the original rela- tionships among protein structures in the lower dimensional embedding according to their simi- larities. This result is expected to benefit sub- sequent analyses of protein structures and their functions. 1. INTRODUCTION Recent years have seen phenomenal advances in dimen- sionality reduction (DR) methods that are widely applied in bioinformatics [26, 14, 6, 16], biometrics [19, 4, 15, 17], robotics [11], etc. The rapid growth of DR methods stems from the need to reduce the complexity of the problem at hand. The target of these methods is mainly to find the corresponding counterparts of the input data in a much lower dimensional space without incurring significant information loss. The low dimensional repre- sentation can be used in subsequent procedures such as classification, pattern recognition, and so on. For exam- ple, a medium sized protein structure typically has a few thousands of degrees of freedom which is naturally re- siding in very high dimensional space. It causes the so- called “curse of dimensionality” problem which will not only drastically increase the computational complexity of the learning algorithms but also require large storage space, leading to very slow indexing and searching speed in a large scale database in the sequel. Through DR, most of the redundant dimensions can be removed and the degrees of freedom left are then input to the subse- quent discriminant and classification tasks with consid- erable simplication. Another advantage of using DR is the 2- or 3- dimensional mappings of the original data can be visual- ized in an Euclidean space that can facilitate interpreting the relationships among data by the researchers. Nor- mally, the relationships among a set of protein structures are typically represented in the form of trees derived by hierarchical clustering. However, this representation only provides some hints on the evolutionary distances between protein structures. This limitation motivates our applying to the application of DR methods in visualizing the similarity relationships among protein structures. In this paper, we will propose a new DR algorithm called Constrained Twin Kernel Embedding (CTKE) based on Twin Kernel Embedding (TKE) [10] to achieve the above target. To provide the necessary background knowledge, we will first give a brief review of DR meth- ods on protein structures in the next section, followed by an introduction of TKE and related topics involved in this new algorithm. Then the CTKE algorithm will be introduced, which integrates the similarity matching by TKE with the objective function used in LS-SVM [23]. Experiments on real protein structure data will be pre- sented and finally we summarize this paper with a con- clusion. 2. THE RELATED WORKS DR methods can be categorized into Linear DR methods (LDR) such as Principal Component Analysis (PCA) [13], Linear Discriminant Analysis (LDA) [7] and NonLinear DR methods (NLDR) such as ISOMAP [25], Laplacian Eigenmaps (LE) [3], Locally Linear Embed- ding (LLE) [20], etc. LDR methods have been widely used in bioinformatics due to their simplicity. For exam- ple, Teodoro et al. [27] applied PCA to transform the original high dimensional protein motion data into a lower dimensional representation that captures the domi- nant modes of motions of the protein. However, the line- arity assumption on which linear methods are con- structed does not hold in most cases. In [5], Das et al. SciRes Copyright © 2008