Structural Prediction of Protein-Protein Interactions in Saccharomyces cerevisiae Martin S.R. Paradesi, Doina Caragea, William H. Hsu Department of Computing and Information Sciences, Kansas State University Manhattan, KS 66506 USA {pmsr, dcaragea, bhsu}@cis.ksu.edu Abstract—Protein-protein interactions (PPI) refer to the associations between proteins and the study of these associations. Several approaches have been used to address the problem of predicting PPI. Some of them are based on biological features extracted from a protein sequence (such as, amino acid composition, GO terms, etc.); others use relational and structural features extracted from the PPI network, which can be represented as a graph. Our approach falls in the second category. We adapt a general approach to graph feature extraction that has previously been applied to collaborative recommendation of friends in social networks. Several structural features are identified based on the PPI graph and used to learn classifiers for predicting new interactions. Two datasets containing Saccharomyces cerevisiae PPI are used to test the proposed approach. Both these datasets were assembled from the Database of Interacting Proteins (DIP). We assembled the first data set directly from DIP in April 2006, while the second data set has been used in previous studies, thus making it easy to compare our approach with previous approaches. Several classifiers are trained using the structural features extracted from the interactions graph. The results show good performance (accuracy, sensitivity and specificity), proving that the structural features are highly predictive with respect to PPI. Keywords: protein-protein interaction, graph mining, machine learning I. INTRODUCTION Protein-protein interactions (PPI) play an important role in the study of biological processes. Many PPI have been discovered over the years and several databases have been created to store the information about these interactions (e.g. BIND, DIP, MIPS, IntAct, MINT and MIPS). Mering et al. [8] state that about 80,000 interactions between yeast proteins are currently available from various high-throughput interaction- detection methods. Determining PPI using high-throughput methods is not only expensive and time-consuming, but also generates a high number of false positives and false negatives. Therefore, there is a need for computational approaches that can help in the process of identifying real protein interactions. From a machine learning point of view, this problem can be seen as a binary classification problem, and can be addressed using supervised learning algorithms. In this paper, we use a graph mining approach to predict the existence of a PPI in a network of interacting proteins. II. PREVIOUS WORK Several methods have been designed to address the task of predicting protein-protein interactions. Most of them [1], [2], [10] and [12] use features extracted from protein sequences (e.g., amino acids composition) or associated with protein sequences directly (e.g., GO annotation). Others use relational and structural features extracted from the PPI network, along with the features related to the protein sequence. When using the PPI network to design features, several node and topological features can be extracted directly from the associated graph. Qi et al. [8] divide the protein interaction prediction task into three sub-tasks: (1) prediction of physical (or actual) interaction among proteins, (2) prediction of proteins belonging to the same complex and (3) prediction of proteins belonging to the same pathway. They apply several feature classifiers on the prediction tasks considered. Their results show that RandomForest is the one of the top two classifiers for all tasks; the other one is RandomForest similarity-based k-Nearest- Neighbor. Licamele & Getoor [4] combine the link structure of the PPI graph with the information about proteins in order to predict the interactions in a yeast dataset, gathered from several databases. More specifically, they look at the shared neighborhood among proteins and calculate the clustering coefficient among the neighborhoods for the first-order and second-order protein relations. They obtained reasonably good accuracy of 81% when predicting new links from noisy high throughput data. The abovementioned approaches use relational data of the PPI network along with other biologically relevant information (such as, sequence, gene expression data, GO terms, etc.) to predict the protein interactions. However, as opposed to these approaches, we use only the relational features of the PPI network data in our study. III. OUR APPROACH In related work, but a completely different application domain, Hsu et al. [3] address the problem of collaboratively recommending friends for a person, based on structural features extracted from a given social network graph. Their approach to the collaborative recommendation of friends uses the link structure of the social network and also information about mutually declared interests. They use structural features