D212–D220 Nucleic Acids Research, 2019, Vol. 47, Database issue Published online 5 November 2018 doi: 10.1093/nar/gky1077 RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12 Alberto Santos-Zavaleta 1 , Heladia Salgado 1 , Socorro Gama-Castro 1 , Mishael S ´ anchez-P ´ erez 1 , Laura G ´ omez-Romero 1 , Daniela Ledezma-Tejeida 1 , Jair Santiago Garc´ ıa-Sotelo 1 , Kevin Alquicira-Hern ´ andez 1 , Luis Jos ´ e Mu ˜ niz-Rascado 1 , Pablo Pe ˜ na-Loredo 1 , Cecilia Ishida-Guti ´ errez 1 , David A. Vel´ azquez-Ram´ ırez 1 , ıctor Del Moral-Ch ´ avez 1 ,C´ esar Bonavides-Mart´ ınez 1 , Carlos-Francisco M ´ endez-Cruz 1 , James Galagan 2 and Julio Collado-Vides 1,2,* 1 Centro de Ciencias Gen ´ omicas, Universidad Nacional Aut´ onoma de M ´ exico, Cuernavaca, Morelos 62210, M´ exico and 2 Department of Biomedical Engineering, Boston University, Boston, MA, USA Received September 18, 2018; Revised October 16, 2018; Editorial Decision October 18, 2018; Accepted October 19, 2018 ABSTRACT RegulonDB, first published 20 years ago, is a com- prehensive electronic resource about regulation of transcription initiation of Escherichia coli K-12 with decades of knowledge from classic molecular bi- ology experiments, and recently also from high- throughput genomic methodologies. We curated the literature to keep RegulonDB up to date, and initi- ated curation of ChIP and gSELEX experiments. We estimate that current knowledge describes between 10% and 30% of the expected total number of tran- scription factor- gene regulatory interactions in E. coli. RegulonDB provides datasets for interactions for which there is no evidence that they affect expres- sion, as well as expression datasets. We developed a proof of concept pipeline to merge binding and ex- pression evidence to identify regulatory interactions. These datasets can be visualized in the RegulonDB JBrowse. We developed the Microbial Conditions On- tology with a controlled vocabulary for the minimal properties to reproduce an experiment, which con- tributes to integrate data from high throughput and classic literature. At a higher level of integration, we report Genetic Sensory-Response Units for 200 tran- scription factors, including their regulation at the metabolic level, and include summaries for 70 of them. Finally, we summarize our research with Nat- ural language processing strategies to enhance our biocuration work. INTRODUCTION RegulonDB is a database that offers, in an organized and computable form, the accumulated knowledge obtained through decades of experimentation in many different labo- ratories around the world, on transcriptional regulation in Escherichia coli K-12. It was frst published 20 years ago, in 1998 (1), and since then we have periodically published progress reports for our work in database issues of Nucleic Acids Research. Our curation is shared with the EcoCyc database (2), which together with RegulonDB provide the major up to date resources of organized knowledge for the best-known bacterial genome model organism. The major avenues of recent progress are the following. We have made important progress in implementing differ- ent components that allowed us to expand RegulonDB to include high throughput (HT)-generated knowledge. This included the design and implementation of the Micro- bial Conditions Ontology (MCO), which provides a formal framework and defnes the set of properties necessary to specify the conditions as well as the genetic material used in a particular study, in order to adequately describe how an experiment was performed in a way that should satisfy its reproducibility. This was inspired by suggestions made by Fred Neidhardt years ago (3). In parallel, we have made progress in the curation of HT-generated literature, particu- larly binding sites identifed from gSELEX and ChIP types of experiments, in conjunction with the corresponding ex- pression profle experiments (4). Here, we report the initial construction of a semi-automatic pipeline that incorporates both binding and expression datasets in order to identify those transcription factor (TF) binding sites (TFBSs) up- stream of genes that show change in expression under sim- * To whom correspondence should be addressed. Tel: +52 777 3132063; Fax: +52 777 3175581; Email: collado@ccg.unam.mx C The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com Downloaded from https://academic.oup.com/nar/article/47/D1/D212/5160972 by guest on 02 December 2022