Zsyntax: A Formal Language for Molecular Biology with Projected Applications in Text Mining and Biological Prediction Giovanni Boniolo 1,2 , Marcello D’Agostino 3 , Pier Paolo Di Fiore 1,2,4 * 1 IFOM, Istituto FIRC di Oncologia Molecolare, Milano, Italy, 2 Dipartimento di Medicina, Chirurgia ed Odontoiatria, Universita ` di Milano, Milano, Italy, 3 Dipartimento di Scienze Umane, Universita ` di Ferrara, Ferrara, Italy, 4 Istituto Europeo di Oncologia, Milano, Italy Abstract We propose a formal language that allows for transposing biological information precisely and rigorously into machine- readable information. This language, which we call Zsyntax (where Z stands for the Greek word fvg ´ , life), is grounded on a particular type of non-classical logic, and it can be used to write algorithms and computer programs. We present it as a first step towards a comprehensive formal language for molecular biology in which any biological process can be written and analyzed as a sort of logical ‘‘deduction’’. Moreover, we illustrate the potential value of this language, both in the field of text mining and in that of biological prediction. Citation: Boniolo G, D’Agostino M, Di Fiore PP (2010) Zsyntax: A Formal Language for Molecular Biology with Projected Applications in Text Mining and Biological Prediction. PLoS ONE 5(3): e9511. doi:10.1371/journal.pone.0009511 Editor: Mark Isalan, Center for Genomic Regulation, Spain Received May 22, 2009; Accepted January 27, 2010; Published March 3, 2010 Copyright: ß 2010 Boniolo et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: Work in the laboratory of PPDF is supported by grants from AIRC (Italian Association for Cancer Research), the European Community, the Monzino Foundation, the Ferrari Foundation and the Cariplo Foundation. Work by GB and MD’A was supported by the Italian Ministry of University and Research (PRIN 2007). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: pierpaolo.difiore@ifom-ieo-campus.it Introduction It is often claimed that biology needs to be formalized (see for instance the Special issue of Science, Mathematics in Biology, of Feb 6 th 2004 available at http://www.sciencemag.org/content/ vol303/issue5659/index.dtl). In principle, there are many advan- tages that might be drawn from the implementation of a formal biological language, since formalization ensures non-ambiguity and a degree of precision that cannot be achieved by ordinary language. Indeed, there are numerous excellent examples of the application of mathematics to describe biological systems: take, for instance, the theory of graphs and, in particular, the progress made in the field of scale-free networks [1,2,3,4], or the wide- spread use of the theory of differential equations to describe biological kinetics and dynamics, as any text-book of mathematical biology illustrates [5,6,7]. However, each of these applications is limited to the particular system it aims to describe. That is, fragments of mathematical knowledge are applied in function of the given biological situations to be analyzed. The modeling of biochemical systems has also been addressed drawing on formal methods from computer science, by exploiting the analogy between biochemical reactions and computational processes. For example, intensive research has been carried out on extensions and adaptations of the p-calculus, a formalism originally developed for the specification of concurrent processes [8,9,10,11,12,13] that can be used to model biochemical networks as mobile communication systems. Other groups have focused on developing software environments by means of a rule-based syntax that can be interpreted in terms of several reaction models, making use of techniques from (classical) temporal logic to formalize their properties and query the models [14,15]. Moreover, important effort has been devoted to treat the well-known phenomenon of combinatorial explosion, i.e., the fact that the number of distinct states of protein complexes grows exponentially with the number of binding domains and interaction surfaces present in proteins, by introducing macrostates, i.e., quantitative indicators of cumulative properties of the system such as levels of occupancy or degrees of phosphorylation [16] or introducing approximation techniques, such as the layer-based approach [17]. These efforts have greatly improved our ability of modeling biochemical reactions by means of rigorous mathematical tools, leading to formalisms that are amenable to computer implemen- tation. On the other hand, the formal and mathematical techniques involved, although biologically meaningful, may – in some cases – prove too difficult to grasp (and to implement) for the working biologist. For example, while arguing in favor of the Kappa-calculus, an extension of the p-calculus, Fontana admits that ‘‘the reduction of concepts from concurrency to biological practice is neither simple to implement nor easy for biologists to grasp. It deals with unfamiliar concepts, whose clarification took a long time even within their domain of origin’’ [9]. While we fully recognize the significant advances made in all these research areas, we argue – in this paper – for a logical approach to biochemical processes, by exploiting the analogy between such processes and logical deductions. We recognize that such an endeavor might meet with the same difficulties encountered by other formalizations, in terms of acceptance and usage by working biologists. For this reason we have attempted to construct and to propose our formalism in the most biologist-centered way. Since our main objective is to attract the attention of the working biologist, the present exposition aims at providing an informal account of the main ideas underlying the project, while a more detailed formal PLoS ONE | www.plosone.org 1 March 2010 | Volume 5 | Issue 3 | e9511