Computational Prediction of Host-Pathogen
Interactions Through Omics Data Analysis
and Machine Learning
Diogo Manuel Carvalho Leite
1,2(&)
, Xavier Brochet
1,2(&)
,
Grégory Resch
3
, Yok-Ai Que
4
, Aitana Neves
1,2
,
and Carlos Peña-Reyes
1,2(&)
1
School of Business and Engineering Vaud (HEIG-VD),
University of Applied Sciences of Western Switzerland (HES-SO),
Yverdon-Les-Bains, Switzerland
{diogo.leite,xavier.brochet,carlos.pena}@heig-vd.ch
2
SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
3
Department of Fundamental Microbiology,
University of Lausanne, Lausanne, Switzerland
4
Department of Intensive Care Medicine,
Bern University Hospital (Inselspital), Bern, Switzerland
Abstract. The emergence and rapid dissemination of antibiotic resistance,
worldwide, threatens medical progress and calls for innovative approaches for
the management of multidrug resistant infections. Phage-therapy, i.e., the use of
viruses (phages) that specifically infect and kill bacteria during their life cycle, is
a re-emerging and promising alternative to solve this problem. The success of
phage therapy mainly relies on the exact matching between the target pathogenic
bacteria and the therapeutic phage. Currently, there are only a few tools or
methodologies that ef ficiently predict phage-bacteria interactions suitable for the
phage therapy, and the pairs phage-bacterium are thus empirically tested in
laboratory. In this paper we present an original methodology, based on an
ensemble-learning approach, to predict whether or not a given pair of
phage-bacteria would interact. Using publicly available information from
Genbank and phagesdb.org, we assembled a dataset containing more than two
thousand phage-bacterium interactions with their corresponding genomes. A set
of informative features, extracted from these genomes, form the base of the
quantitative datasets used to train our predictive models. These features include
the distribution of predicted protein-protein interaction scores, as well as the
amino acid frequency, the chemical composition, and the molecular weight of
such proteins. Using an independent test dataset to evaluate the performance of
our methodology, our approach gets encouraging performance with more than
90% of accuracy, specificity, and sensitivity.
Keywords: Ensemble-learning Á Genomic Á Machine-learning Á Phage-
therapy Á Protein-protein interaction
A. Neves and C. Peña-Reyes–Contributed equally to this work.
© Springer International Publishing AG 2017
I. Rojas and F. Ortuño (Eds.): IWBBIO 2017, Part II, LNBI 10209, pp. 360–371, 2017.
DOI: 10.1007/978-3-319-56154-7_33