Robust Genetic Network Modeling by Adding Noisy Data E.P. van Someren L.F.A. Wessels M.J.T. Reinders E. Backer Information and Communication Theory Group, Control Laboratory Faculty of Information Technology and Systems, Delft University of Technology, P.O. Box 5031, 2600 GA Delft, The Netherlands E.P.vanSomeren@its.tudelft.nl Keywords: Genetic Networks, Robust Modeling, Tikhonov Regularization, Ridge Regression Abstract The most fundamental problem in genetic network model- ing is generally known as the dimensionality problem. Typi- cal gene expression matrices contain measurements of thou- sands of genes taken over fewer than twenty time-steps. A large dynamic network cannot be learned from data with such a limited number of time-steps without the use of additional constraints, preferably derived from biological knowledge. In this paper, we present an approach that can find rough es- timates of the underlying genetic network based on limited time-course gene expression data by employing the fact that gene expression measurements are relatively noisy and ge- netic networks are thought to be robust. The method expands the data-set by adding noisy duplicates, thereby simultane- ously tackling the dimensionality problem and making the solutions more robust against (the already large) noise in the data. This simple concept is similar to adding a Tikhonov regularization term in the optimization process. In the case of linear models, the addition of noisy duplicates is equivalent to ridge regression, i.e. the sum of the squared weights is min- imized as well as the prediction error. In the limiting case, it becomes even equivalent to the application of the Moore- Penrose Pseudo-Inverse to the original data. The strength of the proposed concept of adding noisy duplicates lies in the fact that it can be employed to all modelling approaches, in- cluding non-linear models. 1 Introduction Current micro-array technology has caused a significant increase in the number of genes whose expression can be measured simultaneously on a single array. However, the number of measurements that are taken in a time-course ex- periment has not increased in a similar fashion. As a re- sult, typical gene expression data-sets consist of relatively few time-points (generally less than 20) with respect to the number of genes (thousands). This so called dimensionality problem and the fact that measurements contain a substantial amount of measurement noise are two of the most fundamen- tal problems in genetic network modeling [1]. Genetic network modeling is the field of research that tries to find the underlying network of gene-gene interactions from This work has also been submitted to ISMB’01. the measured set of gene expressions. Up till now, several different modeling approaches have been suggested, such as Boolean networks [2], Bayesian networks [3, 4], (Quasi)- Linear networks [5, 6], Neural networks [7, 8] and Differ- ential Equations [9]. In these approaches, genetic interactions are represented by parameters in a parametric model which need to be inferred from the measured gene expressions over time. Generally, when learning parameters of genetic net- work models from ill-conditioned data (many genes, few time samples), the solutions become arbitrary. Comparative stud- ies [1, 10] have recently reported that at least a number of the currently proposed models suffer from poor inferential power. This is partly caused by their limited use of imposed criteria that improve robustness, consistency and stability of the solu- tions. In this paper, we employ the concept of artificially expanding the original measured data-set with a set of noisy duplicates. We show that training models on this expanded set not only makes the models more robust against noise but also tackles the dimensionality problem. In [11] it has been shown that adding noise to the training set is equivalent to a Tikhonov regularization, provided the noise amplitude is kept small. In addition, it is generally known that the class of Tikhonov reg- ularizers is especially suited for ill-conditioned data [12]. The concept of adding noise can, however, also be applied to non- linear models as well. First, we present the fundamental problems of genetic net- work modeling and review currently known methodologies to overcome these problems. Then, we introduce our principle motivation that leads to the basic idea of expanding the train- ing set and show its potential success. After the introduction of the basic idea, the method of adding noise is explained and illustrated by applying it in the case of a linear genetic net- work model. As part of this example, we briefly cover the equivalence between adding noise and regularization as well as its relation with ridge regression and the Moore-Penrose Pseudo Inverse in the case of a linear model. A detailed experimental study shows how to set the parameters involv- ing the proposed method. Further, these studies show under which conditions the inferred networks are closer to the true ones. This improved performance is also shown with respect to other models that are currently proposed in literature.