Data Management Challenges in Species Distribution Modeling Colin Talbert 1 Marian Talbert 1 Jeff Morisette 1 David Koop 2 1 U.S. Geological Survey, Fort Collins, CO 2 New York University, NY Abstract An important component in the fields of ecology and conservation biology is understanding the environ- mental conditions and geographic areas that are suitable for a given species to inhabit. A common tool in determining such areas is species distribution modeling which uses computer algorithms to determine the spatial distribution of organisms. Most commonly the correlative relationships between the organ- ism and environmental variables are the primary consideration. The data requirements for this type of modeling consist of known presence and possibly absence locations of the species as well as the values of environmental or climatic covariates thought to define the species habitat suitability at these loca- tions. These covariate data are generally extracted from remotely sensed imagery, interpolated/gridded historical climate data, or downscaled climate model output. Traditionally, ecologists and biologists have constructed species distribution models using workflows and data that reside primarily on their local workstations or networks. This workflow is becoming challenging as scientists increasingly try to use these modeling techniques to inform management decisions under different climate change scenar- ios. This challenge stems from the fact that remote sensing products, gridded historical climate, and downscaled climate models are not only increasing in spatial and temporal resolution but proliferating as well. Any rigorous assessment of uncertainty requires a computationally intensive sensitivity analysis accounting for various sources of uncertainty. The scientists fitting these models generally do not have the background in computer science required to take advantage of recent advances in web-service based data acquisition, remote high-powered data processing, or scientific workflow systems. Ecologists in the field of modeling are in need of a tractable platform that abstracts the inherent computational complex- ity required to incorporate the burgeoning field of coupled climate and ecological response modeling. In this paper we describe the computational challenges in species distribution modeling and solutions using scientific workflow systems. We focus on the Software for Assisted Species Modeling (SAHM) a package within VisTrails, an open-source scientific workflow system. 1 Introduction The objective of this paper is to demonstrate the utility of scientific workflow tools in addressing the increas- ingly complex data management issues associated with species distribution modeling (SDM). We start with a brief overview of SDM and the currently available software options. We then explain two major factors that are contributing to the increasing modeling complexity. First there is the growing array of SDM algorithms Copyright 2013 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 31