IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 13, NO. 2, JUNE 2014 97 Hybrid Method Inference for the Construction of Cooperative Regulatory Network in Human I. Chebil , R. Nicolle, G. Santini, C. Rouveirol, and M. Elati Abstract—Reconstruction of large scale gene regulatory networks (GRNs in the following) is an important step for under- standing the complex regulatory mechanisms within the cell. Many modeling approaches have been introduced to nd the causal rela- tionship between genes using expression data. However, they have been suffering from high dimensionality—large number of genes but a small number of samples, overtting, heavy computation time and low interpretability. We have previously proposed an original Data Mining algorithm LICORN, that infers cooperative regulation network from expression datasets. In this work, we present an extension of LICORN to a hybrid inference method H-LICORN that uses search in both discrete and real valued spaces. LICORN’s algorithm, using the discrete space to nd cooperative regulation relationships tting the target gene expression, has been shown to be powerful in identifying cooperative regulation relationships that are out of the scope of most GRN inference methods. Still, as many of related GRN inference techniques, LICORN suffers from a large number of false positives. We propose here an extension of LICORN with a numerical selection step, expressed as a linear regression problem, that effectively comple- ments the discrete search of LICORN. We evaluate a bootstrapped version of H-LICORN on the in silico DREAM5 dataset and show that H-LICORN has signicantly higher performance than LICORN, and is competitive or outperforms state of the art GRN inference algorithms, especially when operating on small data sets. We also applied H-LICORN on a real dataset of human bladder cancer and show that it performs better than other methods in nding candidate regulatory interactions. In particular, solely based on gene expression data, H-LICORN is able to identify experimentally validated regulator cooperative relationships involved in cancer. Index Terms—Cancer, ensemble methods, gene regulatory net- work (GRN), linear regression. I. INTRODUCTION R ESEARCH IN biological network inference has received a growing interest during the last ten years from both computer scientists and statisticians. This is mainly due to the availability of experimental data such as mRNA concentration measures (transcriptomic data) that opens the door to the au- tomated identication of biological networks. The goal is to identify, for each gene expressed in a particular cellular context, Manuscript received April 04, 2014; accepted April 06, 2014. Date of publi- cation April 22, 2014; date of current version May 29, 2014. Asterisk indicates corresponding author. *I. Chebil is with the University Paris 13, Sorbonne Paris Cité, LIPN, CNRS, UMR 7030, F-93430, Villetaneuse, France (e-mail: i.chebil@lipn.univ-paris13. fr). G. Santini and C. Rouveirol are with the University Paris 13, Sorbonne Paris Cité, LIPN, CNRS, UMR 7030, F-93430, Villetaneuse, France (e-mail: g.san- tini@lipn.univ-paris13.fr; c.rouveirol@lipn.univ-paris13.fr). R. Nicolle and M. Elati are with the iSSB, University of Evry-Val-d’Es- sonne, CNRS, FRE3561, 91030 Evry Cedex, France (e-mail: r.nicolle@issb. genopole.fr, m.elati@issb.genopole.fr). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TNB.2014.2316920 the regulators (e.g., transcription factors) affecting its transcrip- tion, and the coordination of several regulators in specic types of regulation. This problem is highly complex specially when dealing with human regulations, as the number of candidate net- works is exponential in the number of genes that are potentially involved in the network [1]. Network inference methods are quite diverse (see [2], [3] for a review). The main families include both unsupervised [4], [5] and supervised [6], [7] machine learning methods, as well as probabilistic approaches such as graphical Gaussian models [8], [9] and static [10], [11] or dynamic [12] Bayesian networks. In the following, we describe two classical frameworks which have been shown to be competitive and exible. The rst one is the Boolean network approach, in which data are discretized, and interactions modeled by logical functions. It aims at highlighting interactions by pointing out co-expression and crossing this kind of information with Gene Ontology an- notation or prior knowledge about transcription factors. In our previous work, we have introduced the LICORN algorithm [5]. It efciently searches the discretized gene expression matrix for sets of co-activators and co-repressors by using a frequent item- sets search technique [13] and locally selects combinations of co-repressors and co-activators as candidate subnetworks. Its application to yeast data showed its ability to detect co-operative transcriptional regulation patterns not identied by other tech- niques [5] and validated the biological accuracy of the found co-regulation sets through functional enrichment based on Gene Ontology [14]. Nicolle et al. [15] also showed that the learned structure can be used to decipher regulatory variations and in- crease the stability of predictive models in transcriptomic data. The second framework is Gaussian Graphical Modelling (GGM) [16], in which a multidimensional Gaussian variable is characterized by its concentration matrix, where conditional independence between pairs of variables induces a zero entry [17]. Several inference strategies have been proposed to recover the most signicant edges, or non zero entries in the concen- tration matrix: simple thresholding or ranking techniques, in particular penalized approaches via sparsity-inducing penal- ties [18]. Recently, a new family of algorithms emerged in the eld of GRN inference that rely on supervised classication (regres- sion) and boosting algorithms. The idea of boosting [19], and more generally of learning techniques based on randomized subsets of the original data, is to construct a strong learner from a combination of multiple weak learners. Two such approaches model the inference of the GRN as a feature selection problem for each target gene. ADANET [20] scores each potential reg- ulator according to its discriminative power on an ensemble 1536-1241 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.