Research Article
Developing eThread Pipeline Using SAGA-Pilot Abstraction for
Large-Scale Structural Bioinformatics
Anjani Ragothaman,
1
Sairam Chowdary Boddu,
2
Nayong Kim,
2
Wei Feinstein,
3
Michal Brylinski,
2,3
Shantenu Jha,
1
and Joohyun Kim
2
1
RADICAL, ECE, Rutgers University, New Brunswick, NJ 08901, USA
2
Center for Computation and Technology, Louisiana State University, Baton Rouge, LA 70803, USA
3
Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
Correspondence should be addressed to Michal Brylinski; mbrylinski@lsu.edu, Shantenu Jha; shantenu.jha@rutgers.edu
and Joohyun Kim; jhkim@cct.lsu.edu
Received 6 March 2014; Accepted 8 May 2014; Published 9 June 2014
Academic Editor: Daniele D’Agostino
Copyright © 2014 Anjani Ragothaman et al. Tis is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive
because of predicted structural information that could uncover the underlying function. However, threading tools are generally
compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing
many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have
developed a pipeline for eTread—a meta-threading protein structure modeling tool, that can use computational resources
efciently and efectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages
large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efciently
select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eTread
and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-
solution or cost-to-solution. Our eTread pipeline can scale to support a large number of sequences and is expected to be a viable
solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such
as prokaryotes. Te developed pipeline is easily extensible to other types of distributed cyberinfrastructure.
1. Introduction
Modern systems biology holds a signifcant promise to accel-
erate the development of personalized drugs, namely, tailor-
made pharmaceuticals adapted to each person’s own genetic
makeup. Consequently, it helps transform symptom-based
disease diagnosis and treatment to “personalized medicine,”
in which efective therapies are selected and optimized for
individual patients [1]. Tis process is facilitated by various
experimental high-throughput technologies such as genome
sequencing, gene expression profling, ChIP-chip/ChIP-seq
assays, protein-protein interaction screens, and mass spec-
trometry [2–4]. Complemented by computational and data
analytics techniques, these methods allow for the compre-
hensive investigation of genomes, transcriptomes, proteomes,
and metabolomes, with an ultimate goal to perform a global
profling of health and disease in unprecedented detail [5].
High-throughput DNA sequencing, such as Next-
Generation Sequencing (NGS) [6–8], is undoubtedly one
of the most widely used techniques in systems biology. By
providing genome-wide details on gene sequence, organi-
zation, variation, and regulation, NGS provides means to fully
comprehend the repertoire of biological processes in a living
cell. Importantly, continuing advances in genome sequencing
technologies result in rapidly decreasing costs of experiments
making them afordable for individual researchers as well
as small research groups [8]. Nevertheless, the substantial
volume of biological data adds computational complexity
to downstream analyses including functional annotation
of gene sequences of a donor genome [9]. Consequently,
Hindawi Publishing Corporation
BioMed Research International
Volume 2014, Article ID 348725, 12 pages
http://dx.doi.org/10.1155/2014/348725