Research Article Perceptron ensemble of graph-based positive-unlabeled learning for disease gene identification Gholam-Hossein Jowkar*, Eghbal G. Mansoori School of Electrical and Computer Engineering, Shiraz University, Shiraz, Iran A R T I C L E I N F O Article history: Received 27 March 2016 Received in revised form 25 June 2016 Accepted 8 July 2016 Available online 12 July 2016 Keywords: Disease gene identification Biological networks Positive-unlabeled learning Ensemble of classifiers Perceptron A B S T R A C T Identification of disease genes, using computational methods, is an important issue in biomedical and bioinformatics research. According to observations that diseases with the same or similar phenotype have the same biological characteristics, researchers have tried to identify genes by using machine learning tools. In recent attempts, some semi-supervised learning methods, called positive-unlabeled learning, is used for disease gene identification. In this paper, we present a Perceptron ensemble of graph- based positive-unlabeled learning (PEGPUL) on three types of biological attributes: gene ontologies, protein domains and protein-protein interaction networks. In our method, a reliable set of positive and negative genes are extracted using co-training schema. Then, the similarity graph of genes is built using metric learning by concentrating on multi-rank-walk method to perform inference from labeled genes. At last, a Perceptron ensemble is learned from three weighted classifiers: multilevel support vector machine, k-nearest neighbor and decision tree. The main contributions of this paper are: (i) incorporating the statistical properties of gene data through choosing proper metrics, (ii) statistical evaluation of biological features, and (iii) noise robustness characteristic of PEGPUL via using multilevel schema. In order to assess PEGPUL, we have applied it on 12950 disease genes with 949 positive genes from six class of diseases and 12001 unlabeled genes. Compared with some popular disease gene identification methods, the experimental results show that PEGPUL has reasonable performance. ã 2016 Elsevier Ltd. All rights reserved. 1. Introduction In biomedical research, identification of genes underlying human hereditary is essential for prenatal and postnatal diagnosis and treatment (Piro and Cunto, 2012). Huntington as the first genetic disease, on the 4th chromosome of human DNA, was discovered by using polymorphism information (Bromberg, 2013). After that, the biologists focused on gene associated diseases and mutation on genes to identify genetic disorders and gene associated diseases. By screening them, the vulnerabilities of a child for inherited diseases before his/her birth can be determined. Also, the prognosis and counselling of affected families are discussed, and in some cases, this can lead to the development of therapeutic strategies (Piro and Cunto, 2012). Since the abnormal function of genes, in the body, causes some diseases, it is necessary to identify the molecular pathway of these disorders (Bromberg, 2013). In this regard, the study on the properties of disease genes showed that the genes with the same or similar diseases stay in the same neighborhood in molecular networks (Piro and Cunto, 2012). Moreover, the traditional tools were expensive and time-consuming. These observations lead to the development of computational approaches for prediction or priorization of candidate disease genes (Wang et al., 2011). These approaches rely on the observations that diseases with the same or similar phenotype have the same biological characteristic. In this regard, computational analysis is used to combine different data sources, functional information of genes is used to extract disease gene knowledge, and machine learning methods are used to predict the disease genes. Disease genes identification, in terms of learning type, can be categorized into three groups of unsupervised, supervised and semi-supervised learning. Traditionally, researchers face the classification of disease genes as a supervised learning method (Kohler et al., 2008; Smalter et al., 2007; Radivojac et al., 2008), though it is regarded as a semi-supervised problem in some researches (Yang et al., 2012; Yang et al., 2014). Since in semi- supervised methods, the learning starts with a small set of labeled (positive and negative) samples, the obtained model faces a small subset of positive samples and a huge subset of unlabeled samples * Corresponding author. E-mail addresses: hjowkar@shirazu.ac.ir (G.-H. Jowkar), mansoori@shirazu.ac.ir (E.G. Mansoori). http://dx.doi.org/10.1016/j.compbiolchem.2016.07.004 1476-9271/ã 2016 Elsevier Ltd. All rights reserved. Computational Biology and Chemistry 64 (2016) 263–270 Contents lists available at ScienceDirect Computational Biology and Chemistry journal home page : www.elsevier.com/loca te/compbiolchem