IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 13, NO. 2, JUNE 2014 97
Hybrid Method Inference for the Construction of
Cooperative Regulatory Network in Human
I. Chebil , R. Nicolle, G. Santini, C. Rouveirol, and M. Elati
Abstract—Reconstruction of large scale gene regulatory
networks (GRNs in the following) is an important step for under-
standing the complex regulatory mechanisms within the cell. Many
modeling approaches have been introduced to find the causal rela-
tionship between genes using expression data. However, they have
been suffering from high dimensionality—large number of genes
but a small number of samples, overfitting, heavy computation
time and low interpretability. We have previously proposed an
original Data Mining algorithm LICORN, that infers cooperative
regulation network from expression datasets. In this work, we
present an extension of LICORN to a hybrid inference method
H-LICORN that uses search in both discrete and real valued spaces.
LICORN’s algorithm, using the discrete space to find cooperative
regulation relationships fitting the target gene expression, has
been shown to be powerful in identifying cooperative regulation
relationships that are out of the scope of most GRN inference
methods. Still, as many of related GRN inference techniques,
LICORN suffers from a large number of false positives. We propose
here an extension of LICORN with a numerical selection step,
expressed as a linear regression problem, that effectively comple-
ments the discrete search of LICORN. We evaluate a bootstrapped
version of H-LICORN on the in silico DREAM5 dataset and show
that H-LICORN has significantly higher performance than LICORN,
and is competitive or outperforms state of the art GRN inference
algorithms, especially when operating on small data sets. We
also applied H-LICORN on a real dataset of human bladder cancer
and show that it performs better than other methods in finding
candidate regulatory interactions. In particular, solely based on
gene expression data, H-LICORN is able to identify experimentally
validated regulator cooperative relationships involved in cancer.
Index Terms—Cancer, ensemble methods, gene regulatory net-
work (GRN), linear regression.
I. INTRODUCTION
R
ESEARCH IN biological network inference has received
a growing interest during the last ten years from both
computer scientists and statisticians. This is mainly due to the
availability of experimental data such as mRNA concentration
measures (transcriptomic data) that opens the door to the au-
tomated identification of biological networks. The goal is to
identify, for each gene expressed in a particular cellular context,
Manuscript received April 04, 2014; accepted April 06, 2014. Date of publi-
cation April 22, 2014; date of current version May 29, 2014. Asterisk indicates
corresponding author.
*I. Chebil is with the University Paris 13, Sorbonne Paris Cité, LIPN, CNRS,
UMR 7030, F-93430, Villetaneuse, France (e-mail: i.chebil@lipn.univ-paris13.
fr).
G. Santini and C. Rouveirol are with the University Paris 13, Sorbonne Paris
Cité, LIPN, CNRS, UMR 7030, F-93430, Villetaneuse, France (e-mail: g.san-
tini@lipn.univ-paris13.fr; c.rouveirol@lipn.univ-paris13.fr).
R. Nicolle and M. Elati are with the iSSB, University of Evry-Val-d’Es-
sonne, CNRS, FRE3561, 91030 Evry Cedex, France (e-mail: r.nicolle@issb.
genopole.fr, m.elati@issb.genopole.fr).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNB.2014.2316920
the regulators (e.g., transcription factors) affecting its transcrip-
tion, and the coordination of several regulators in specific types
of regulation. This problem is highly complex specially when
dealing with human regulations, as the number of candidate net-
works is exponential in the number of genes that are potentially
involved in the network [1].
Network inference methods are quite diverse (see [2], [3] for
a review). The main families include both unsupervised [4], [5]
and supervised [6], [7] machine learning methods, as well as
probabilistic approaches such as graphical Gaussian models [8],
[9] and static [10], [11] or dynamic [12] Bayesian networks. In
the following, we describe two classical frameworks which have
been shown to be competitive and flexible.
The first one is the Boolean network approach, in which data
are discretized, and interactions modeled by logical functions. It
aims at highlighting interactions by pointing out co-expression
and crossing this kind of information with Gene Ontology an-
notation or prior knowledge about transcription factors. In our
previous work, we have introduced the LICORN algorithm [5].
It efficiently searches the discretized gene expression matrix for
sets of co-activators and co-repressors by using a frequent item-
sets search technique [13] and locally selects combinations of
co-repressors and co-activators as candidate subnetworks. Its
application to yeast data showed its ability to detect co-operative
transcriptional regulation patterns not identified by other tech-
niques [5] and validated the biological accuracy of the found
co-regulation sets through functional enrichment based on Gene
Ontology [14]. Nicolle et al. [15] also showed that the learned
structure can be used to decipher regulatory variations and in-
crease the stability of predictive models in transcriptomic data.
The second framework is Gaussian Graphical Modelling
(GGM) [16], in which a multidimensional Gaussian variable
is characterized by its concentration matrix, where conditional
independence between pairs of variables induces a zero entry
[17]. Several inference strategies have been proposed to recover
the most significant edges, or non zero entries in the concen-
tration matrix: simple thresholding or ranking techniques, in
particular penalized approaches via sparsity-inducing penal-
ties [18].
Recently, a new family of algorithms emerged in the field of
GRN inference that rely on supervised classification (regres-
sion) and boosting algorithms. The idea of boosting [19], and
more generally of learning techniques based on randomized
subsets of the original data, is to construct a strong learner from
a combination of multiple weak learners. Two such approaches
model the inference of the GRN as a feature selection problem
for each target gene. ADANET [20] scores each potential reg-
ulator according to its discriminative power on an ensemble
1536-1241 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.