1 Adaptive least absolute regression network analysis improves genetic network reconstruction by employing prior knowledge James Yong-A-Poi 1,* , Eugene van Someren 2 , Domenico Bellomo 1 and Marcel Reinders 1 1 Information and Communication Theory group, Faculty of EEMCS, Delft University of Technology, 2600 GA, Delft, The Netherlands 2 Molecular Design and Informatics, N.V. Organon, part of Schering-Plough Corporation, 5340 BH, Oss, The Netherlands ABSTRACT Motivation: The inference of genetic regulatory networks from time- series gene expression data has been performed with linear models. The challenge of this inference problem is solving an underdeter- mined system in which the number of genes is far greater than the number of measurements. LARNA (least absolute regression net- work analysis) tackles this problem by employing the LASSO (least absolute shrinkage and selection operator) technique for simultane- ous estimation and variable selection. However, with the availability of different data sources (e.g. literature network, transcription factor binding information) capturing different parts of the true network, integration of this type of prior knowledge with expression data into LARNA can potentially improve the variable selection and eventually the reconstructed network. Results: We propose to integrate prior knowledge into LARNA by modifying the LASSO penalty. The performance of our scheme is evaluated on synthetic and real datasets. The evaluation focused on part of the network for which no prior knowledge was available. Results indicate that the integration of prior knowledge improved in reconstructing part of the network without prior knowledge. 1 L 1 INTRODUCTION The behavior of a cell largely depends on the complex interactions between genes, proteins and metabolites. The activity (a.k.a. ex- pression) of a gene is regulated by special proteins, called tran- scription factors (TFs) that bind to the promoter of the gene. Acti- vated genes produce specific mRNA molecules that are in turn translated into proteins. Proteins perform all kinds of functions, e.g. they can act as TFs for other genes, as enzymes catalyzing metabolic reactions or as structural elements in the cell. An exam- ple of a network of molecular interactions is depicted in Figure 1. The complete cellular system can be simplified by considering only gene, protein or metabolic interactions. Gene networks fo- cuses on gene interactions and are phenomenological models of how the expression level of each gene is influenced by the expres- sion level of all other genes (Brazhnik, 2002). Here, connections between genes are characterized as either direct or indirect (see caption to Figure 1 for details). Gene networks are important, because they might contain valu- able information for the pharmaceutical and biotechnology indus- tries to design novel drugs for complex diseases. They are able to describe how cells react to external influences in a concise way by * To whom correspondence should be addressed. gene connections, which implicitly capture regulatory mechanisms at the protein and metabolite space (Gardner, 2005). Linear models have been proposed to infer genetic regulatory networks from time-series gene expression data in many papers (e.g. Chen, 1999; D’Haeseleer, 1999; van Someren, 2006; Cosentino, 2007). Linear models assume that the future expression level of a gene is a linear combination of the past expression levels of all genes. This assumption is only valid near equilibrium, be- cause the Hartman-Grobman theorem states that the behavior of a (biological) nonlinear system around a steady-state is similar to a linearised system (Grobman, 1959; Hartman, 1960). Despite the fact that linear models are already a strong simplifi- cation of the underlying biological system, they are still underde- termined as the number of genes is far greater than the number of measured time points. The LASSO method (Tibshirani, 1996) is an elegant shrinkage method, which performs feature selection and parameter estimation simultaneously. It has previously been ap- plied to overcome the dimensionality problem during the inference of a 100 gene osteoblast differentiation network from eleven time points (van Someren, 2006). The approach was called LARNA. In this paper, we propose to integrate prior knowledge into the linear regression model LARNA (van Someren, 2006) by adjusting its shrinkage method such that it can utilize partial information about the topology of the true network. A drawback of traditional LARNA when trying to infer the right regulator from a group of genes that contains highly correlated expression profiles, is that it may arbitrarily select one of these genes. Prior knowledge about the likelihood of each of these genes as the potential regulator of the current target should aid LARNA in selecting the correct gene by taking this prior knowledge into account. Prior knowledge can be defined as any additional information about connections between genes. Transcription factor binding information identifies the binding of transcription factors to pro- moters of genes. Literature networks are build from searching arti- cles for co-occurrences of pair of genes. Many resources can serve as prior knowledge and it is important to integrate them. The main goal of this paper is to evaluate whether our approach of adding prior knowledge to LARNA has a positive effect in dis- covering the original network. We evaluate the effect on the parts of the networks for which no prior knowledge was given. More- over, we evaluate whether our approach is able to infer the original network when the prior knowledge contains errors. Evaluation of our new approach, called adaptive LARNA (aLARNA), on real data is difficult, because the underlying true gene network is unknown. Therefore, we first estimate its perform- ance on synthetic datasets. In this way, we can compare the in- ferred networks with the original network and include (part of) the