Merlin : Metabolic Models Reconstruction using Genome-Scale Information ⋆ Oscar Dias * Miguel Rocha ** Eug´ enio C. Ferreira * Isabel Rocha * * IBB Institute for Biotechnology and Bioengineering, Centre of Biological Engineering, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal (e-mail: [odias, ecferreira, irocha]@ deb.uminho.pt). ** CCTC -Computer Science and Technology Centre, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal (e-mail: mrocha@ di.uminho.pt) Abstract: This article describes Merlin, a user-friendly program that performs functional genomic annotations of lists of genes. Merlin retrieves information of each homologue and automatically scores the results, allowing the user to change the score selection, and dynamically (re-)annotate the genome. Merlin expedites the transition from genome-scale data to SBML metabolic models, allowing the user to have a preliminary view of the biochemical network. Keywords: Systems Biology, Genome-Scale Reconstruction, BLAST, SBML, Metabolic Engineering. 1. INTRODUCTION Genome-scale reconstructed metabolic models are based on the well-known stoichiometry of biochemical reactions and can be used for simulating in silico the phenotypic behaviour of a microorganism, under diﬀerent environmen- tal and genetic conditions, thus representing an impor- tant tool in Metabolic Engineering [Rocha et al. (2008)]. The reconstruction of a metabolic network associates the genome of a given organism to its physiology, through the replication of the biochemical reactions and molecular mechanisms taking place in a given organism [Francke et al. (2005)]. The genome-scale reconstruction of metabolic networks encompasses several steps, such as genome annotation, reactions identiﬁcation and stoichiometry determination, compartmentation, determination of the biomass composi- tion, energy requirements and additional constraints. The ﬁrst step (genome annotation) is essential to this type of reconstruction, because precursory data can be extracted for the model reconstruction. Information such as gene or open reading frame (ORF) names, assigned cellular func- tions, sequence similarities, and, for the enzyme coding genes, the Enzyme Commission (EC) number(s) should be retrieved to accomplish the ﬁrst stage of the mathematical model development [Rocha et al. (2008)]. According to the Integrated Microbial Genomes (IMG) system [Markowitz et al. (2006)] there are currently more than 4.000 genomes (4.368 as of December 2009) fully sequenced with more than 700 genomes (747 as of Decem- ber 2009) being drafted right now. Sequence similarities between genes and genomes can be established using well ⋆ This work is supported by a PhD grant from the portuguese Funda¸c˜aoparaaCiˆ encia e a Tecnologia: SFRH/BD/47307/2008. known algorithms such as BLAST [Altschul et al. (1990)] or FASTA [Lipman and Pearson (1985)]. 2. GENOME ANNOTATION Genome Annotation encompasses both ”gene ﬁnding”, on the sequenced genome, and the assignment of biological functions to the recently found genes [Medigue and Moszer (2007); Salzberg (2007)]. Gene ﬁnding in eukaryotic genomes is diﬀerent than in the prokaryotic ones, as about 90% of the bacterial genome are coding sequences. On the other hand, higher eukary- otes have less than 10% of coding sequences. Moreover, eukaryotes generally have two or more overlapping open reading frames, and it is diﬃcult to identify the start of translation and ﬁnd regulatory signals such as promoters and terminators [Salzberg et al. (1998)]. There are several software tools for gene ﬁnding. Almost all use probabilistic methods, such as Hidden Markov Models (HMM), to identify coding sequences within the open reading frames. Examples of such applications are GLIMMER [Salzberg et al. (1998)], GenMark [Borodovsky and Mcininch (1993)], EuG` ene [Foissac and Schiex (2005)]. Alternatively, there are some tools that use methods other than HMM, such as Gismo [Krause et al. (2007)]. A list of some of these, and some other, applications is available at www.geneﬁnding.org/software.html. Some of the software applications listed above also attach biological data (functional annotation) to the recognised genes. Other tools that annotate the genome at the pro- tein level, are GOAnno [Chalmel et al. (2005)], or Gene- FAS [Joshi et al. (2004)] which uses Bayesian probability of function similarity between two connected genes and 11th International Symposium on Computer Applications in Biotechnology Leuven, Belgium, July 7-9, 2010 978-3-902661-70-8/10/$20.00 © 2010 IFAC 120 10.3182/20100707-3-BE-2012.0076