A Data Integration Method for Exploring Gene Regulatory Mechanisms Jane Synnergren Systems Biology Research Centre University of Skövde SE-541 28 Skövde +46 (0)500-448311 jane.synnergren@his.se Björn Olsson Systems Biology Research Centre University of Skövde SE-541 28 Skövde +46 (0)500-448316 bjorn.olsson@his.se Jonas Gamalielsson Systems Biology Research Centre University of Skövde SE-541 28 Skövde +46 (0)500-448375 jonas.gamalielsson@his.se ABSTRACT Systems biology aims to understand the behavior of and interaction between various components of the living cell, such as genes, proteins, and metabolites. A large number of components are involved in these complex systems and the diversity of relationships between the components can be overwhelming, and there is therefore a need for analysis methods incorporating data integration. We here present a method for exploring gene regulatory mechanisms which integrates various types of data to assist the identification of important components in gene regulation mechanisms. By first analyzing gene expression data, a set of differentially expressed genes is selected. These genes are then further investigated by combining various types of biological information, such as clustering results, promoter sequences, binding sites, transcription factors and other previously published information regarding the selected genes. Inspired by Information Fusion research, we also mapped functions of the proposed method to the well-known OODA-model to facilitate application of this data integration method in other research communities. We have successfully applied the method to genes identified as differentially expressed in human embryonic stem cells at different stages of differentiation towards cardiac cells. We identified 15 novel motifs that may represent important binding sites in the cardiac cell linage. Categories and Subject Descriptors J.3 [Computer Applications]: Life and Medical Sciences – biology and genetics. General Terms Algorithms, Design, Reliability, Experimentation, Verification Keywords Gene expression, gene regulation, motifs, data integration, data fusion 1. INTRODUCTION Despite much effort we are still far from a comprehensive understanding of many important biological systems. The amount of biological data stored in public databases increases rapidly (in many cases exponentially) and the development of efficient methods that facilitate the biological interpretation of these data is crucial. The identification of genomic regulatory elements is an important but unsolved problem in genome annotation. At present we have only limited knowledge of transcription factors (TFs), their binding sites, and the genes which they regulate. Since regulatory elements are frequently short and often include variable positions, their identification and discovery using computational algorithms is challenging. A vast number of computational approaches for identification of regulatory elements have been developed in the past decade, and significant advances have been made regarding methods for identification of regulatory elements. An review by GuhaThakurta [1] surveyed many computational methods for identification of transcriptional regulatory elements, but none of these methods use a data integration approach. Thus, in the present work we focus on efficient identification of regulatory mechanisms, and propose an approach for analysis and interpretation of gene expression data based on the integration of various types of related biological information. 1.1 Gene regulation mechanisms The transcription of genes is modulated by the interaction of TFs and the affinity by which they bind to their binding sites. Thus, they are essential molecules for the regulation of gene expression, since they control the transcription of genetic information from DNA to RNA. The regulatory function of TFs can be carried out by a single molecule or by a complex of proteins, and the regulatory effect can be to induce or repress the transcription of the regulated gene. Binding sites are short stretches of DNA in the regulatory regions, located upstream of the regulated gene, where TFs bind to the DNA. The nucleotide sequences of binding sites can typically be represented by motifs, i.e. short patterns of nucleotide combinations that identify specific binding sites. To increase our understanding of how gene regulation is controlled it is of critical importance to identify common regulatory motifs, and investigate their roles in various regulatory processes. Several algorithms have been developed for computational prediction of transcriptional regulatory mechanisms from sequence data, gene expression data and interaction data [2]. Recent results have shown the usefulness of large-scale gene expression data for prediction of gene regulation networks [3-4]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DTMBIO’08, October 30, 2008, Napa Valley, California, USA. Copyright 2008 ACM 978-1-60558- 251-1 /08/10...$5.00. 81