Literature Mining of Host-Pathogen Interactions: Comparing Feature-based Supervised Learning {Ranjit, #929}and Language- based Approaches Thanh Thieu 1 , Sneha Joshi 2 , Samantha Warren 1 , and Dmitry Korkin 1,2,3,* 1 Department of Computer Science, University of Missouri, Columbia, MO, USA 2 Informatics Institute, University of Missouri, Columbia, MO, USA. 3 Bond Life Science Center, University of Missouri, Columbia, MO, USA * ABSTRACT Motivation: In an infectious disease, the pathogen’s strategy to enter the host organism and breach its immune defenses often involves interactions between the host and pathogen proteins. Currently, the experimental data on host-pathogen interactions (HPIs) are scattered across multiple databases, which are often specialized to target a specific disease or host organism. An accurate and efficient method for the automated extraction of HPIs from biomedical literature is crucial for creating a unified repository of HPI data. Results: Here, we introduce and compare two new approaches to automatically detect whether the title or abstract of a PubMed publication contains HPI data, and extract the information about organisms and proteins involved in the interaction. The first approach is a feature-based supervised learning method using support vector machines (SVMs). The SVM models are trained on the features derived from the individual sentences. These features include names of the host/pathogen organisms and corresponding proteins or genes, keywords describing HPI-specific information, more general protein-protein interaction information, experimental methods, and other statistical information. The language-based method employed a link grammar parser combined with semantic patterns derived from the training examples. The approaches have been trained and tested on manually curated HPI data. When compared to a naïve approach based on the existing protein-protein interaction literature mining method, our approaches demonstrated higher accuracy and recall in the classification task. The most accurate, feature-based, approach achieved 73% to 76% accuracy, depending on the test protocol. Availability: Both approaches are available through PHILM web- server: http://korkinlab.org/philm.html Contact: korkin@korkinlab.org Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION Infections are complex biological processes that target host organisms from virtually all kingdoms of life and involve a variety of microbial pathogens, such as viruses, bacteria, fungi, protozoa, multicellular parasites, and even proteins (Anderson and May, * To whom correspondence should be addressed. 1979; Mandell and Townsend, 1998). The pathogen’s strategy to enter the host organism and breach the host’s immune defenses often involves interactions between the host and pathogen proteins (Dyer, et al., 2010; Konig, et al., 2008). Systematic determination and analysis of host-pathogen interactions (HPIs) provides a challenging task for both experimental and computational approaches, and is critically dependent on previously obtained knowledge about these interactions. For example, several bioinformatics approaches apply the homology information to predict new HPIs, characterize the interaction structures, or find conservation patterns across HPI networks (Davis, et al., 2007; Dyer, et al., 2007; Dyer, et al., 2010; Franzosa and Xia, 2011). Other approaches, either manual or reliant on the existing databases, collect the HPI molecular or genetic data into a centralized repository (Driscoll, et al., 2009; Kumar and Nanduri, 2010; Winnenburg, et al., 2008). However, a fully automated system for extracting molecular HPI data directly from the biomedical literature is yet to be built. Rapid growth of published biomedical research has resulted in the development of a number of methods for biomedical literature mining during the last decade (Krallinger and Valencia, 2005; Rodriguez-Esteban, 2009). The methods dealing with the biomolecular information are generally divided into three categories based on the domain of biomedical knowledge they target: (i) automated protein or gene identification in a text (Mika and Rost, 2004; Seki and Mostafa, 2005; Tanabe, et al., 2005) (ii) literature-based functional annotation of genes and proteins (Meinhart and Cramer), and (iii) extracting the information on the relationships between biological molecules, such as proteins and RNAs, or genes (Hu, et al., 2005; Lee, et al., 2008; Shatkay, et al., 2007). The relationships detected by the methods from the third category range from a co-occurrence of the genes and proteins in a text (Hoffmann and Valencia, 2005) to detecting the protein- protein interactions (PPIs) (Blaschke and Valencia, 2001; Donaldson, et al., 2003; Marcotte, et al., 2001) and identification of signal transduction networks and metabolic pathways (Friedman, et al., 2001; Hoffmann, et al., 2005; Santos and Eggle, 2005). Related to the problem of mining HPIs, the problem of extracting general PPIs from the text has received an increasing amount of attention from the community. A number of approaches have been recently proposed to extract PPIs from text. A basic approach determines an occurrence of a PPI by detecting the co- occurrence of names of the interacting proteins or genes in the Associate Editor: Dr. Jonathan Wren © The Author (2012). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com Bioinformatics Advance Access published January 27, 2012 by guest on July 4, 2016 http://bioinformatics.oxfordjournals.org/ Downloaded from