Journal of Bioinformatics and Computational Biology Vol. 5, No. 3 (2007) 717–738 c Imperial College Press INCORPORATING HOMOLOGUES INTO SEQUENCE EMBEDDINGS FOR PROTEIN ANALYSIS ELEAZAR ESKIN ∗ Department of Computer Science, Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, 90095 eeskin@cs.ucla.edu SAGI SNIR School of Computer Science and Mathematics, Netanya Academic College, Netanya, Israel, 42100 ssagi@netanya.ac.il Received 16 December 2005 Revised 23 October 2006 Accepted 23 February 2007 Statistical and learning techniques are becoming increasingly popular for different tasks in bioinformatics. Many of the most powerful statistical and learning techniques are applicable to points in a Euclidean space but not directly applicable to discrete sequences such as protein sequences. One way to apply these techniques to protein sequences is to embed the sequences into a Euclidean space and then apply these techniques to the embedded points. In this work we introduce a biologically motivated sequence embed- ding, the homology kernel, which takes into account intuitions from local alignment, sequence homology, and predicted secondary structure. This embedding allows us to directly apply learning techniques to protein sequences. We apply the homology kernel in several ways. We demonstrate how the homology kernel can be used for protein family classification and outperforms state-of-the-art methods for remote homology detection. We show that the homology kernel can be used for secondary structure prediction and is competitive with popular secondary structure prediction methods. Finally, we show how the homology kernel can be used to incorporate information from homologous sequences in local sequence alignment. Keywords : Protein classification; sequence alignment; kernel methods. 1. Introduction The analysis of protein sequences is one of the most successful areas in bioinfor- matics. Three major ideas and intuitions are used over and over again in many ∗ Corresponding author. 717