ENTROPY-BASED ANALYSIS OF CHIP-SEQUENCING DATA Hossein Zare, Mostafa Kaveh Dept. of ECE, University of Minnesota, Minneapolis, MN Email: hossein,mos@ece.umn.edu ABSTRACT ChIP-Sequencing (ChIP-Seq) is an advanced emerging technology to detect protein-DNA associations and to identify transcription factor binding sites. This technol- ogy, which is an alternative to the ChIP-on-chip tech- nique, provides several advantages including data with higher resolution and quality. In this paper we present a framework for the analysis of ChIP-Seq data in order to identify targets of a transcription factor and its bind- ing sites. The introduced method employs the relative entropy measure to identify candidate binding regions with high affinity in the genome and then applies a peak- finding algorithm to locate the local peak(s) within each region. We have applied this method to analyze chromo- somal binding patterns of Lrp, a global transcriptional regulator of amino acid metabolism in Escherichia coli. Index Terms- Chip Sequencing, DNA Binding sites, Relative Entropy, Transcription Factor 1. INTRODUCTION Detecting associations between proteins and DNA sig- nals is an important part of gene regulation studies and therefore is essential for understanding of many biolog- ical processes and their anomalies. Controloftranscrip- tion and replication depends on the recognition of spe- cialized DNA sequences, referred to as "binding sites", by regulatory proteins. Over the past few years, in addi- tion to computational motif discovery algorithms such as [1, 2] and the pattern matching algorithm [3, 4, 5], high throughput technologies namely ChIP-on-chip and ChIP-Sequencing have emerged to identify transcription factor binding sites. ChIP-on-chip is an experimental technique which uses chromatin immunoprecipitation THIS WORK WAS SUPPORTED IN PART BY THE UNIVERSITY OF MINNESOTA DIS- SERTATION FELLOWSHIP Arkady B. Khodursky Dept. of BMBB, University of Minnesota, St. Paul, MN Email: khoduOOl@umn.edu and microarray technology to identify the binding of proteins to DNA in vivo [6]. ChIP-Sequencing technol- ogy combines chromatin immunoprecipitation with next generation [7] sequencing technology for the same pur- poses. Using this technology millions of short sequence reads are produced and mapped to the whole genome. This output covers the entire genome and with high re- dundancy in reads provides very high quality data. The shortness (25-36bp) of sequence reads provides very high resolution in identifying the precise location of enriched DNA fragments, which can be interpreted as binding sites. Thus, ChIP-Sequencing is a promising and alternative technique to ChIP-on-chip that allows identification of transcription factor binding sites, espe- cially in organisms with high genome complexity. Since the technology is relatively young, there are only a few studies on ChIP-sequencing data[7, 8]. Algorithms presented in these studies identify peaks in the signal using a global threshold and depending on the choice of threshold they may have high false positive or high false negative rate. In [8] an additional ChIP-Seq data set (pool of non-immunoprecipitated DNA) has been used to adjust the threshold for a fixed false discovery rate (FDR). However, it is not apparent why the mock IP, an immunoprecipitation reaction without antibodies, would generate the relevant background distribution of sequence reads. Therefore, with the growing demand for ChIP-Sequencing it is necessary to provide a more statistically sound framework for the analysis ChIP-Seq data. Here, we present a data analysis approach that takes into account the biological fact that transcription fac- tors, including Lrp, bind with higher affinity to the pro- moter regions than to the coding regions. Based on such assumption, the data from the coding regions can be treated as background, or null distribution. In this frame- work, the regions with high affinity binding to the tran-