Full-genomic network inference for non-model organisms: A case study for the fungal pathogen Candida albicans J¨ org Linde, Ekaterina Buyko, Robert Altwasser, Udo Hahn, Reinhard Guthke Abstract—Reverse engineering of full-genomic interaction net- works based on compendia of expression data has been successfully applied for a number of model organisms. This study adapts these approaches for an important non-model organism: The major human fungal pathogen Candida albicans. During the infection process, the pathogen can adapt to a wide range of environmental niches and reversibly changes its growth form. Given the importance of these processes, it is important to know how they are regulated. This study presents a reverse engineering strategy able to infer full- genomic interaction networks for C. albicans based on a linear regression, utilizing the sparseness criterion (LASSO). To overcome the limited amount of expression data and small number of known interactions, we utilize different prior-knowledge sources guiding the network inference to a knowledge driven solution. Since, no database of known interactions for C. albicans exists, we use a text- mining system which utilizes full-text research papers to identify known regulatory interactions. By comparing with these known regulatory interactions, we find an optimal value for global modelling parameters weighting the influence of the sparseness criterion and the prior-knowledge. Furthermore, we show that soft integration of prior-knowledge additionally improves the performance. Finally, we compare the performance of our approach to state of the art network inference approaches. Keywords—Pathogen, Network Inference, Text-Mining, Candida albicans, LASSO, Mutual Information, Reverse Engineering, Linear Regression, Modelling I. I NTRODUCTION T HE esearch community has successfully predicted full- genomic interaction networks of model organisms, such as Escherichia coli [1]. From a methodological point of view, the various full-genomic network inference methods can be divided into approaches based on (partial) correlation [2], information theory [1], [3], [4], and linear regression [5]. The integration of prior-knowledge based on additional data sources to gene expression data significantly improves the reverse engineering approach [6]–[8]. Understanding the interaction networks of human pathogenic microorganisms is important for the identification of drug targets and design of medical treatment. So far, only small scale models for pathogenic bacteria [9] or fungi [8], [10], [11] have been suggested. While for model organisms large databases with known interactions exist [12] and a large J. Linde, R. Altwasser, and R. Guthke are with the Research Group Systems Biology / Bioinformatics at the Leibniz Institute for Natural Product Research and Infection Biology - Hans Kn¨ oll Institute (HKI), D-07745 Jena, Germany, Email: joerg.linde@hki-jena.de, www.sysbio.hki-jena.de/ Ekaterina Buyko and Udo Hahn are with the Jena University Language and Information Engineering (JULIE) Lab at the Friedrich-Schiller-University D- 07743 Jena, Germany, www.julielab.de/ amount of expression data is available [13], for C. albicans less expression data is available and only a few interactions are known(but not collected in a data base). During the last decades, the morbility and mortality rates due to C. albicans infections have been increasing, making this organisms one of the most important human fungal pathogens [14]. The infection process is characterized by a change from a harmless commensal to an aggressive pathogen, phenotipic growth form switches, and adaptations to changing environmental parameters (pH, temperature, nutrient availability...) [15]. All these processes lead to dramatic changes in gene expression patterns [15], [16]. Unrevealing the underlying interaction network will improve our understanding on how the pathogen is able to start and maintain infection. Since there are only a few regulatory interactions known, it is important to use computer based models to predict gene interactions in C. albicans . During the last decades many C. albicans researchers have performed gene expression studies and a large amount data has become publicly available [13], facilitating the inference of full- genomic network models. Presently, human professionals called “biocurators” create and maintain gold-standard databases of scientific knowledge from molecular biology. The curation task is known to be an extremely time-consuming and manual process. As Baumgart- ner et al [17] have shown, the exponential growth rate of publi- cations already outpaces human capabilities to keep track with the speed of publication of documents relevant for database curation. Hence, the current state of the art of database curation requires a profound change of methodologies for accessing and structuring information from the biomedical literature. Hahn et al [18] promote applications of automatic text min- ing procedures that would render a reasonable support since considerable progress has been made during the past years in this field. Indeed, the automatic harvesting of information from biomedical literature has caught high attention in recent years, and is witnessed by various challenges such as BIOCRE- ATIVE(Critical Assessment of Information Extraction systems in Biology) [19] and the BIONLP SHARED TASK ON EVENT EXTRACTION [20] series. The success of text mining tools for automatic database synthesis has already been demonstrated for the REGULONDB [12], world’s largest manually curated reference database for the transcriptional regulation network of E. Coli (e.g., [21], [22]). This study reverse engineers full-genomic gene interaction World Academy of Science, Engineering and Technology 56 2011 224