1 High-Performance Data Management for Genome Sequencing Centers Using Globus Online: A Case Study Dinanath Sulakhe Computation Institute Argonne National Laboratory and University of Chicago Chicago, IL 60637 USA sulakhe@mcs.anl.gov Rajkumar Kettimuthu Computation Institute Argonne National Laboratory and University of Chicago Chicago, IL 60637 USA kettimut@mcs.anl.gov Utpal Dave Computation Institute Argonne National Laboratory and University of Chicago Chicago, IL 60637 USA dave@ci.uchicago.edu Abstract— In the past few years in the biomedical field, availability of low-cost sequencing methods in the form of next- generation sequencing has revolutionized the approaches life science researchers are undertaking in order to gain a better understanding of the causative factors of diseases. With biomedical researchers getting many of their patients’ DNA and RNA sequenced, sequencing centers are working with hundreds of researchers with terabytes to petabytes of data for each researcher. The unprecedented scale at which genomic sequence data is generated today by high-throughput technologies requires sophisticated and high-performance methods of data handling and management. For the most part, however, the state of the art is to use hard disks to ship the data. As data volumes reach tens or even hundreds of terabytes, such approaches become increasingly impractical. Data stored on portable media can be easily lost, and typically is not readily accessible to all members of the collaboration. In this paper, we discuss the application of Globus Online within a sequencing facility to address the data movement and management challenges that arise as a result of exponentially increasing amount of data being generated by a rapidly growing number of research groups. We also present the unique challenges in applying a Globus Online solution in sequencing center environments and how we overcome those challenges. Index Terms— Globus, Globus Online, GridFTP, sequencing center, data transfer, data management, grid, cloud, next-gen sequencing, translational medicine I. INTRODUCTION Today’s research communities in various scientific domains such as physics, astronomy, cosmology, and biology are dealing with an unprecedented data deluge [1]. Technological advances in scientific methodologies and instrumentation are generating massive amounts of data that require sophisticated and high-performance computational capabilities. In the biomedical field, for example, the low-cost [2] availability of next-generation and third-generation [3] sequencing in the past few years has encouraged larger as well as smaller research groups to have many of their patients’ DNA and RNA sequenced in order to help improve diagnosis and treatment plans. Dobyns Laboratory [4], a research group at the University of Washington, Seattle, has sequenced hundreds of its patients in the past year, resulting in tens of terabytes of data [5]. The lab uses various sequencing centers (PerkinElmer, Broad Institute, University of Washington) depending on the type of sequencing required. Currently, most of these sequencing centers send the massive raw sequence data back to the research labs on multiple hard disks, using snail mail (Fedex) [1]. It is an extremely inefficient process. Small research labs suffer from a lack of resources and the expertise to use available advanced computational solutions for data handling, and the large sequencing centers require high- performance tools that would allow them to handle data for hundreds of their clients or researchers. It is an extremely challenging task for sequencing centers to manage hundreds of researchers and their data at the scale of petabytes, as well as to implement user access control mechanisms and security. They all demand a robust yet simple and transparent high-performance data management solution that provides data movement among multiple locations, security and authentication integrated within local settings, and flexible access control. In this paper, we explore the use of Globus Online [6] to address the needs of a large, multiuser research data facility such as a sequencing center. We highlight the challenges that a typical sequencing center would encounter related to its data management needs, and we discuss how Globus Online can be set up to address these challenges. The remainder of the paper is organized as follows. Section II provides background on Globus Online. Section III discusses the use cases under consideration. Section IV and Section V describe two approaches to addressing the data movement and access control issues for the use cases described in Section III. We explain the advantage of our approach in Section VI. In Section VII we discuss how Globus Online can be used for further sequence analysis, and in Section VIII we outline future work. We conclude in Section IX with a brief summary.