Research Article Developing eThread Pipeline Using SAGA-Pilot Abstraction for Large-Scale Structural Bioinformatics Anjani Ragothaman, 1 Sairam Chowdary Boddu, 2 Nayong Kim, 2 Wei Feinstein, 3 Michal Brylinski, 2,3 Shantenu Jha, 1 and Joohyun Kim 2 1 RADICAL, ECE, Rutgers University, New Brunswick, NJ 08901, USA 2 Center for Computation and Technology, Louisiana State University, Baton Rouge, LA 70803, USA 3 Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA Correspondence should be addressed to Michal Brylinski; mbrylinski@lsu.edu, Shantenu Jha; shantenu.jha@rutgers.edu and Joohyun Kim; jhkim@cct.lsu.edu Received 6 March 2014; Accepted 8 May 2014; Published 9 June 2014 Academic Editor: Daniele D’Agostino Copyright © 2014 Anjani Ragothaman et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive because of predicted structural information that could uncover the underlying function. However, threading tools are generally compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have developed a pipeline for eTread—a meta-threading protein structure modeling tool, that can use computational resources efciently and efectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efciently select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eTread and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to- solution or cost-to-solution. Our eTread pipeline can scale to support a large number of sequences and is expected to be a viable solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such as prokaryotes. Te developed pipeline is easily extensible to other types of distributed cyberinfrastructure. 1. Introduction Modern systems biology holds a signifcant promise to accel- erate the development of personalized drugs, namely, tailor- made pharmaceuticals adapted to each person’s own genetic makeup. Consequently, it helps transform symptom-based disease diagnosis and treatment to “personalized medicine,” in which efective therapies are selected and optimized for individual patients [1]. Tis process is facilitated by various experimental high-throughput technologies such as genome sequencing, gene expression profling, ChIP-chip/ChIP-seq assays, protein-protein interaction screens, and mass spec- trometry [2–4]. Complemented by computational and data analytics techniques, these methods allow for the compre- hensive investigation of genomes, transcriptomes, proteomes, and metabolomes, with an ultimate goal to perform a global profling of health and disease in unprecedented detail [5]. High-throughput DNA sequencing, such as Next- Generation Sequencing (NGS) [6–8], is undoubtedly one of the most widely used techniques in systems biology. By providing genome-wide details on gene sequence, organi- zation, variation, and regulation, NGS provides means to fully comprehend the repertoire of biological processes in a living cell. Importantly, continuing advances in genome sequencing technologies result in rapidly decreasing costs of experiments making them afordable for individual researchers as well as small research groups [8]. Nevertheless, the substantial volume of biological data adds computational complexity to downstream analyses including functional annotation of gene sequences of a donor genome [9]. Consequently, Hindawi Publishing Corporation BioMed Research International Volume 2014, Article ID 348725, 12 pages http://dx.doi.org/10.1155/2014/348725