ISSN 2255-8004 14 www.bit-journal.eu Biosystems and Information Technology (2012) Vol.1(1) 14-18 DOI: http://dx.doi.org/10.11592/bit.121102 original scientific article Application of string similarity ratio and edit distance in automatic metabolite reconciliation comparing reconstructions and models Martins Mednis 1* , Maike K. Aurich 2 1 Biosystems group,Department of Computer Systems, Latvia University of Agriculture, Liela iela 2, LV3001, Jelgava, Latvia 2 Center for Systems Biology, University of Iceland, Sturlugata 8, IS101, Reykjavik, Iceland *Corresponding author Martins.mednis@llu.lv Received: 14 November 2012; accepted: 27 November 2012; published online: 28 November 2012. This paper has supplementary material. Abstract: Increasing numbers of biochemical network models become available and reuse of these models is becoming more common. As a consequence, tools to compare models are needed. Comparison can be difficult because model builders often use different standards during reconstruction, metabolite formulas are not always indicated, IDs and names of metabolites are different, and models are stored in different formats (SBML, COBRA and others). Herein, a model comparison algorithm for SBML and COBRA format models is presented, called ModeRator. Precondition for correct matching of reactions is the comparison of the participating metabolites. ModeRator is based on the comparison of metabolite names as text strings. An automatic three level filtering approach is implemented in the software, which rejects pairs of potentially equal metabolites and builds an opinion about metabolite pairs with high similarity in metabolite names. ModeRator was applied to two test cases, comparing two models of each, E.coli and S.cerevisiae. Matches of the automatic mapping were manually inspected and compared with the automatic predictions. Automatic metabolite mapping of E.coli models (1314 and 1704 metabolites) comparing only identifiers revealed a high number of accordant metabolites. Both models originate from the same source (BioCyc database). No significant difference between automatic mapping and manual curation are observed. For the comparison of two S.cerevisiae models (679 and 1061 metabolites), three level filtration by metabolite name is used. The discrepancy between manual curated predictions and ModeRator predictions was 7%. Keywords: Biochemical networks, reconstruction, models, metabolite mapping, pairwise comparison. 1. Introduction The function of cells is based on complex networks of interacting chemical reactions carefully organized in space and time. These biochemical reaction networks produce observable cellular functions (Palsson, 2006). Reconstruction of a biochemical network is the assembly of the components and their interconversions that appear within an organism, based on the genome annotation and the bibliome ( Lewis et al., 2009). Reconstruction based models can be used for the analysis of network capabilities, prediction of cellular phenotypes and in silico hypothesis generation (Lewis et al., 2009). The first fully sequenced genome was that of H.influenzae in 1995 (Fleischmann et al., 1995), which enabled the first reconstruction of a genome-scale metabolic network in 1999 (Edwards and Palsson, 1999). Thanks to the accelerating speed of sequencing techniques, it is now possible to reconstruct the network of biochemical reactions in many organisms. The reconstruction process for metabolic networks is well developed (Schellenberger et al., 2011; Thiele and Palsson, 2010) and implemented for a number of different organisms (Palsson, 2006; Schellenberger et al., 2010), including human (Duarte et al., 2007). Several of these networks are available online: Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa and Goto, 2000; Kanehisa et al., 2011), EcoCyc (Keseler et al., 2011), BioCyc (Karp et al., 2005; Karp et al., 2010) and metaTIGER (Whitaker et al., 2009). The available reconstructions and models are growing both in number and size (number of interactions within reconstruction or model). Often multiple reconstructions exist for an organism. The need to compare models or to couple them as parts of larger models has been noted by Radulescu (Radulescu et al., 2008). Manual comparison is very time- consuming especially in case of genome-scale reconstructions consisting of thousands of reactions and metabolites. The demand for a method to relate different models has been pointed out elsewhere (Gay et al., 2010). Genome-size reconstructions are available in different formats. Mostly they are in form of 1) plain text files, 2) SBML (Systems Biology Markup Language (Hucka et al., 2003)) file format and 3) spreadsheets (including COBRA format (Schellenberger et al., 2011)). Due to different formats of models and different approaches to standardization of metabolites and reactions, the comparison of reconstructions and models is complicated. Tools exist that enable visual comparison of the model structure, (Boele et al., 2012; Kostromins and Stalidzans, 2012; Schellenberger et al., 2010) or to compare parameters of the structure (Rubina and Stalidzans, 2010; Yamada and Bork, 2009).