1545-5963 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2016.2621042, IEEE/ACM Transactions on Computational Biology and Bioinformatics VYAS ET AL 1 Application of Genetic Programming (GP) formalism for building disease predictive models from protein-protein interactions (PPI) data Renu Vyas, Sanket Bapat, Purva Goel, M. Karthikeyan, S.S. Tambe and B.D. Kulkarni AbstractProtein-protein interactions (PPIs) play a vital role in the biological processes involved in the cell functions and disease pathways. The experimental methods known to predict PPIs require tremendous efforts and the results are often hindered by the presence of a large number of false positives. Herein, we demonstrate the use of a new Genetic Programming (GP) based Symbolic Regression (SR) approach for predicting PPIs related to a disease. In a case study, a dataset consisting of one hundred and thirty five PPI complexes related to cancer was used to construct a generic PPI predicting model with good PPI prediction accuracy and generalization ability. A high correlation coefficient(CC) of 0.893, low root mean square error (RMSE) and mean absolute percentage error (MAPE) values of 478.221and 0.239, respectively were achieved for both the training and test set outputs. To validate the discriminatory nature of the model, it was applied on a dataset of diabetes complexes where it yielded significantly low CC values.Thus, the GP model developed here serves a dual purpose: (a)a predictor of the binding energy of cancer related PPI complexes,and (b)a classifier for discriminating PPIcomplexes related to can- cer from those of other diseases. Index TermsGenetic Programming, Protein-protein interactions, Disease, Binding energy, Machine learning, Cancer, Symbolic Regression ———————————————————— 1 INTRODUCTION P rotein-protein interactions (PPI) regulate the func- tionsand cellular activity within a cell [1]. The role of PPI in a homeostasis state is to carry out various biological and functional activities of the cell; any disturbance in the cell disrupts the normal functioning of PPIs. The DNA replication, transcription, translation, splicing, cell cycle control and signal transduction, are some biological pro- cesses wherein proteins interact with each other. In a dis- eased state, the normal pathways are up-regulated or down-regulated based on the association or disassocia- tion between a pair of interacting proteins. Prediction of a protein's potential of interactionwith the partner pro- tein becomes essential in modeling the disease progres- sion, prognosis and prediction, and understanding path- ways [2].In the present work, we propose to develop a predictive model for identifying the protein-protein com- plexes related to a disease. The interaction betweentwo interacting polypeptide chains was studied by employing various parameterssuch as the binding energy,and other 3D structural information obtained from the X-ray solved co-crystals as model inputs (predictor variable). A math- ematical expression that correlates these predictor varia- bles with the propensity of polypeptide chains to bind with each other was developed using the evolutionary GP based symbolic regression (SR) method. 1.1 Overview of machine learning studies on protein-protein interactions Experimentally, the information regarding protein- xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society ———————————————— R. Vyas is with MIT school of Bioenginnering Science and research, Loni, Kalbhor, Pune. Email: renu.vyas@mituniversity.edu.in S.Bapat and M. Karthikeyanare with the DIRC, CSIR-National Chemical Laboratory, Pune- 411008 India E-mail: sanket.bapat@yahoo.in, m.karthikeyan@ncl.res.in. P. Goel, S. Tambe and B.D. Kulkarni are with the Chemical Engineering and Process Development Division ,CSIR-National Chemical Laboratory, Pune 411008 India E-mail: goelpurva@gmail.com, ss.tambe@ncl.res.in, bdkulkarni@ncl.res.in Please note that all acknowledgments should be placed at the end of the paper, before the bibliography (note that corresponding authorship is not noted in affiliation box, but in acknowledgment section).