Literature Mining of Host-Pathogen Interactions: Comparing
Feature-based Supervised Learning {Ranjit, #929}and Language-
based Approaches
Thanh Thieu
1
, Sneha Joshi
2
, Samantha Warren
1
, and Dmitry Korkin
1,2,3,*
1
Department of Computer Science, University of Missouri, Columbia, MO, USA
2
Informatics Institute, University of Missouri, Columbia, MO, USA.
3
Bond Life Science Center, University of Missouri, Columbia, MO, USA
*
ABSTRACT
Motivation: In an infectious disease, the pathogen’s strategy to
enter the host organism and breach its immune defenses often
involves interactions between the host and pathogen proteins.
Currently, the experimental data on host-pathogen interactions
(HPIs) are scattered across multiple databases, which are often
specialized to target a specific disease or host organism. An
accurate and efficient method for the automated extraction of HPIs
from biomedical literature is crucial for creating a unified repository
of HPI data.
Results: Here, we introduce and compare two new approaches to
automatically detect whether the title or abstract of a PubMed
publication contains HPI data, and extract the information about
organisms and proteins involved in the interaction. The first
approach is a feature-based supervised learning method using
support vector machines (SVMs). The SVM models are trained on
the features derived from the individual sentences. These features
include names of the host/pathogen organisms and corresponding
proteins or genes, keywords describing HPI-specific information,
more general protein-protein interaction information, experimental
methods, and other statistical information. The language-based
method employed a link grammar parser combined with semantic
patterns derived from the training examples. The approaches have
been trained and tested on manually curated HPI data. When
compared to a naïve approach based on the existing protein-protein
interaction literature mining method, our approaches demonstrated
higher accuracy and recall in the classification task. The most
accurate, feature-based, approach achieved 73% to 76% accuracy,
depending on the test protocol.
Availability: Both approaches are available through PHILM web-
server: http://korkinlab.org/philm.html
Contact: korkin@korkinlab.org
Supplementary information: Supplementary data are available at
Bioinformatics online.
1 INTRODUCTION
Infections are complex biological processes that target host
organisms from virtually all kingdoms of life and involve a variety
of microbial pathogens, such as viruses, bacteria, fungi, protozoa,
multicellular parasites, and even proteins (Anderson and May,
* To whom correspondence should be addressed.
1979; Mandell and Townsend, 1998). The pathogen’s strategy to
enter the host organism and breach the host’s immune defenses
often involves interactions between the host and pathogen proteins
(Dyer, et al., 2010; Konig, et al., 2008). Systematic determination
and analysis of host-pathogen interactions (HPIs) provides a
challenging task for both experimental and computational
approaches, and is critically dependent on previously obtained
knowledge about these interactions. For example, several
bioinformatics approaches apply the homology information to
predict new HPIs, characterize the interaction structures, or find
conservation patterns across HPI networks (Davis, et al., 2007;
Dyer, et al., 2007; Dyer, et al., 2010; Franzosa and Xia, 2011).
Other approaches, either manual or reliant on the existing
databases, collect the HPI molecular or genetic data into a
centralized repository (Driscoll, et al., 2009; Kumar and Nanduri,
2010; Winnenburg, et al., 2008). However, a fully automated
system for extracting molecular HPI data directly from the
biomedical literature is yet to be built.
Rapid growth of published biomedical research has resulted in
the development of a number of methods for biomedical literature
mining during the last decade (Krallinger and Valencia, 2005;
Rodriguez-Esteban, 2009). The methods dealing with the
biomolecular information are generally divided into three
categories based on the domain of biomedical knowledge they
target: (i) automated protein or gene identification in a text (Mika
and Rost, 2004; Seki and Mostafa, 2005; Tanabe, et al., 2005) (ii)
literature-based functional annotation of genes and proteins
(Meinhart and Cramer), and (iii) extracting the information on the
relationships between biological molecules, such as proteins and
RNAs, or genes (Hu, et al., 2005; Lee, et al., 2008; Shatkay, et al.,
2007). The relationships detected by the methods from the third
category range from a co-occurrence of the genes and proteins in a
text (Hoffmann and Valencia, 2005) to detecting the protein-
protein interactions (PPIs) (Blaschke and Valencia, 2001;
Donaldson, et al., 2003; Marcotte, et al., 2001) and identification
of signal transduction networks and metabolic pathways
(Friedman, et al., 2001; Hoffmann, et al., 2005; Santos and Eggle,
2005).
Related to the problem of mining HPIs, the problem of
extracting general PPIs from the text has received an increasing
amount of attention from the community. A number of approaches
have been recently proposed to extract PPIs from text. A basic
approach determines an occurrence of a PPI by detecting the co-
occurrence of names of the interacting proteins or genes in the
Associate Editor: Dr. Jonathan Wren
© The Author (2012). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Bioinformatics Advance Access published January 27, 2012
by guest on July 4, 2016 http://bioinformatics.oxfordjournals.org/ Downloaded from