Proceedings, 10 th World Congress of Genetics Applied to Livestock Production Cyberinfrastructure for Life Sciences - iAnimal Resources for Genomics and Other Data Driven Biology J.M. Reecy 1 , J. P. Carson 2 , F. McCarthy 3 , J. E. Koltes 1 , E. Fritz-Waters 1 , J. Williams 4,5 , E. Lyons 4,6 , C. F. Baes 7 , M. W. Vaughn 2, 4 1 Department of Animal Science, Iowa State University, Ames, IA, 2 Texas Advanced Computing Center, University of Texas, Austin, TX, 3 School of Animal and Comparative Biomedical Sciences, University of Arizona, Tucson, Arizona, 4 The iPlant Collaborative, Thomas W. Keating Bioresearch Building, University of Arizona, Tucson, AZ, 5 Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 6 School of Plant Sciences, University of Arizona, Tucson, AZ, 7 Bern University of Applied Sciences, Switzerland ABSTRACT: Whole genome sequence, SNPs, copy number variation, phenotypes and other “-omics” data underlie evidence-based estimations of breeding value. Unfortunately, the computational resources (data storage, high-performance computing, analysis pipelines, etc.) that exploit this knowledge are limited in availability – many investigations are therefore restricted to the commercial sector or well-funded academic programs. Cyberinfrastructure developed by the iPlant Collaborative (NSF-#DBI0735191) and its extension iAnimal (USDA- #2013-67015-21231) provides the animal breeding community a comprehensive and freely available platform for the storage, sharing, and analyses of large datasets – from genomes to phenotype data. iPlant/iAnimal tools support a variety of genotype-phenotype related analyses in a platform that accommodates every level of user – from breeder to bioinformatician. These tools have been used to develop scalable, accessible versions of common workflows required for applying sequencing to livestock genomics. Keywords: bioinformatics; breeding; analysis pipeline; high-performance computing; next generation sequencing; variant calling Introduction Genomic technologies are now solving previously impossible problems in animal agriculture. High-throughput sequencing and related advances in all areas of information acquisition (phenotypes, climate data, etc.) signal biology’s transition to “Big Data” science. Mapping, sequencing and now the re-sequencing of large numbers of individuals— human, chicken, cattle, swine, sheep, etc.—are feasible projects for individual laboratories, rather than the exclusive domain of international collaborations. 1000-fold reductions in sequencing costs (NHGRI (2014)) make it practical for any lab to become their own genome- sequencing center. In contrast to the increasing accessibility of data, access to computational resources and, perhaps more importantly, broadly-usable interfaces to analysis tools is still lacking; bioinformatics remains a bottleneck (Pérez- Enciso and Ferretti, (2010)). In 2008 The National Science Foundation funded The iPlant Collaborative to develop a national cyberinfrastructure for plant sciences. iPlant services life science researchers and educators working in all domains of life, enabling them to understand and make increasingly powerful predictions about biological systems. In its first 5 years, iPlant successfully developed a platform of integrated technologies and computational resources that provide access to large replicated data storage, high- performance computing, grid computing, and cloud computing. These resources are made available to scientists by providing access at multiple levels including application programming interfaces (APIs), RESTful services, and web-based systems for data access, tool integration, and analysis. iAnimal is a natural extension of iPlant resources made broadly available to an animal community already concerned with similar sets of biological questions. This paper presents the iPlant/iAnimal vision for solving problems in animal breeding/livestock production, and demonstrates a bovine genotyping pipeline as an example of how iPlant/iAnimal resources can be leveraged in investigations relevant to animal science. Beyond specific hardware and software, a key part of the cyberinfrastructure that we are developing includes the people concerned both with producing tools suitable for users with various levels of computational background, and with providing training and learning materials that are key to accelerating discovery. Materials and Methods iAnimal. Recent advances in biotechnology have permitted animal scientists to sequence all the DNA, or genome, of any organism at relatively little cost. In addition, these technologies are used to understand the activity of genes in a genome and the natural variability in genes between individuals. While such data hold the promise of improving US agriculture by enabling animal breeding and genetics, there exists substantial challenges in transforming all those data into usable knowledge by the widest community of animal scientists and breeders. iAnimal is developing an ecosystem of integrated computational resources that leverage prior national investments in cyberinfrastructure (iPlant Collaborative, CoGe, AgBase, and VCMap) to enable agricultural researchers to accelerate their research towards improving US agriculture. We will develop cyberinfrastructure for researchers to manage, analyze, and visualize their quantitative and functional genomics data through an integrated computational platform of existing resources. Three important aspects of this work will enable scientists to more easily make sense of genomic DNA