Vol. 24 no. 11 2008, pages 1325–1331 BIOINFORMATICS REVIEW doi:10.1093/bioinformatics/btn198 Gene expression Eukaryotic transcription factor binding sites—modeling and integrative search methods Sridhar Hannenhalli * Penn Center for Bioinformatics and Department of Genetics, University of Pennsylvania, Philadelphia, USA Received on February 11, 2008; revised and accepted on April 18, 2008 Advance Access publication April 21, 2008 Associate Editor: Jonathan Wren ABSTRACT A comprehensive knowledge of transcription factor binding sites (TFBS) is important for a mechanistic understanding of transcrip- tional regulation as well as for inferring gene regulatory networks. Because the DNA motif recognized by a transcription factor is typically short and degenerate, computational approaches for identifying binding sites based only on the sequence motif inevitably suffer from high error rates. Current state-of-the-art techniques for improving computational identification of binding sites can be broadly categorized into two classes: (1) approaches that aim to improve binding motif models by extracting maximal sequence information from experimentally determined binding sites and (2) approaches that supplement binding motif models with additional genomic or other attributes (such as evolutionary conservation). In this review we will discuss recent attempts to improve computational identification of TFBS through these two types of approaches and conclude with thoughts on future development. Contact: sridharh@pcbi.upenn.edu 1 INTRODUCTION A substantial portion of a cell’s morphological and functional attributes is determined at the level of gene transcription. Thus, a comprehensive mechanistic understanding of transcriptional regulation is an important long-term goal. Eukaryotic protein coding genes are transcribed by RNA polymerase II, however the basal transcription is tightly regulated by complex processes involving chromatin modifying proteins, transcription factors (TF), co-factors and RNA polymerase (Wasserman and Sandelin, 2004). A critical component of transcription control relies on sequence-specific binding of multiple TF to short (13 bps on average) DNA sites in the relative vicinity of the target gene (Kadonaga, 2004). Mutations in the transcription factor binding sites (TFBS) are known to underlie several human diseases and are also likely to underlie a substantial component of the phenotypic variability within and across species (Wray, 2007). A comprehensive knowledge of TFBS is thus critical for understanding the mechanism of transcrip- tional regulation, disease etiology and phenotypic variability. Genome-scale identification of TFBS involves three main steps: (1) experimentally identifying binding sites, (2) construct- ing a model or a motif to represent the set of binding sites for a TF and (3) searching for novel instances of binding sites using the model. Additionally, binding sites for an unknown TF can be identified computationally through de novo motif discovery. 1.1 Experimental identification of binding sites A variety of experimental techniques have been used to identify specific genomic regions bound by a TF. We refer the reader to (Elnitski et al., 2006) for a detailed review of these techniques. Below we provide a brief summary of the techniques. Genomic regions that are hypersensitive to the DNase I enzyme (DNase I HS regions) represent open chromatin regions likely to harbor functional TFBS. A number of experimental techniques exist for determining DNase I HS regions with varying resolution ranging from a few hundred bases to a single nucleotide. Given a DNase I HS region, several follow-up experiments can be done to define the precise boundaries of TFBS, such as DNase I protection or footprinting assays and deletion/mutation experi- ments—the so-called ‘promoter bashing’. Although useful for the discovery of binding sites, these experiments do not identify the associated TFs. To this end, several other techniques have been extensively utilized. In vitro techniques include the Electro- Mobility Shift Assay (EMSA), Systematic Evolution of Ligands by EXponential enrichment (SELEX) and protein-binding DNA microarrays. The most common high-throughput technique for in vivo identification of binding sites for a specific TF is chromatin immunoprecipitation of bound DNA followed either by hybridization (ChIP-chip) or sequencing (ChIP-seq). A detailed review and the related references for the various experimental techniques are provided in (Elnitski et al., 2006). Experimentally determined binding sites are compiled in databases such as TRANSFAC (Matys et al., 2006) and JASPAR (Sandelin et al., 2004). TRANSFAC is a licensed database which currently includes 900 positional weight matrices (PWMs) constructed from published, experimentally determined binding sites for individual TFs. The individual binding sites are assigned a quality score corresponding to the strength of experimental evidence. JASPAR is a freely accessible resource which currently includes 138 non- redundant PWMs, also constructed from literature data, however based on a more stringent set of criteria. Despite the *To whom correspondence should be addressed. ß The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 1325 Downloaded from https://academic.oup.com/bioinformatics/article/24/11/1325/192627 by guest on 13 June 2022