Predicting Protein-Protein Interaction Based on Fisher Scores Extracted from Domain Profiles Tapan Patel and Li Liao* Department of Computer and Information Sciences University of Delaware Newark, Delaware 19716, USA lliao@cis.udel.edu Abstract-In this work, we propose a machine learning method to identify protein-protein interacting partners based on domain level knowledge that can take into account information about the interaction sites. The general approach is to use the profile hidden Markov models of protein domains and the known interactions between domains to train a support vector machine. Proteins are characterized by the vectors of fisher scores that are obtained from comparing the protein sequences to the hidden Markov model for a given domain. Protein pairs, represented by concatenation of their respective fisher score vectors, are classified as interacting partners and non interacting partners by a trained SVM. By selecting the fisher scores based on a profile hidden Markov model that differentiates the interaction sites from other residues within the domain, we demonstrated that the prediction accuracy was significantly improved, as measured in a series of cross validation experiments. Keywords-protein-protein interaction; fisher scores; feacture seleletion; profile hidden Markov models; support vector machines I. INTRODUCTION Predicting protein-protein interaction has become a central task in systems biology for reverse engineering the biological networks. Because the current high throughput experimental approaches to identifying protein-protein interaction (PPI), such as the yeast two-hybrid assay [1,2], are still very costly and not reliable, a lot of attention has recently been given to the development of computational methods. These computational methods predict protein-protein interaction based on information at different levels, from primary sequences, to molecular structures, to evolutionary profiles [3-13]. Because different methods often require different data as input, it is difficult to fairly compare all these different methods - each has its pros and cons. Typically, more sensitive prediction tends to require extensive information, e.g., phylogenetic information, and more specific prediction tends to require more detailed information, e.g., the structural information. It is, therefore, important to be able to transfer knowledge across different sources and at different levels while maintaining a balance between sensitivity and specificity, or better still enhancing both. Proteins interact with one another via some interacting domains, and it has become a common approach for predicting protein-protein interaction by identifying these domains. Although the domains responsible for binding two proteins together tend to possess certain biochemical properties that dictate some specific composition of amino acids, such compositions are typically not unique enough to be solely relied upon for domain identification - variations are common in the multiple sequence alignment of these proteins that contain the same domain. Hidden Markov models are among the most successful efforts to capture the commonalities of a given domain while allowing variations. A collection of hidden Markov models covering many common protein domains and families is available in the PFAM database [14]. However, a few factors can compromise the efforts of using domain-domain interaction (DDI) to predict protein-protein interaction. Although corroborated by other evidences, such as domain modularity of proteins and shared DDI among PPIs, in most cases experimental verification in support of the DDI-PPI correspondence is still missing [15]. As mentioned above, membership of domain families is established at the best via probabilistic modeling, false positives are not uncommon. While the interaction sites within domains, as recently demonstrated, play a key role in determining protein-protein interaction [16], such information is not readily available for many proteins - the dataset of crystallograpically solved PPIs remains relatively small. In a recent work, Fredrich et al [16] developed an improved method that models the interaction sites within proteins domains with interaction profile hidden Markov models. The topology of this newly defined interaction profile hidden Markov model (ipHMM) takes both structural and sequential data into account. However, the structural information is only needed for training the model. Once a model is trained, it can be used to predict interaction sites for proteins with only the sequential information as input. A posterior decoding algorithm that yields probabilities for interacting sequence positions and enhances the quality of interaction site predictions. In this work, we propose a method that utilizes the ipHMMs to predict interacting partners for proteins that only sequential information. Because the model is built based on some structural information at the domain level and is capable of bridging the potential interacting partners with matched structural signatures, the method improves the prediction accuracy while still only requiring the primary sequences as input. Moreover, by selectively extracting the sufficient *Cof responding author: Iliao@cis.udel.edu. 1-4244-1509-8/07/$25.00 02007 IEEE 946