Cloud Data Federation for Scientific Applications Spiros Koulouzis 1 , Dmitry Vasyunin 1 , Reginald Cushing 1 , Adam Belloum 1 , and Marian Bubak 1,2 1 Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands {S.Koulouzis,Dvasunin,R.S.Cushing,A.S.Z.Belloum,M.T.Bubak}@uva.nl 2 Department of Computer Science, AGH Krakow, Poland bubak@agh.edu.pl Abstract. Nowadays, data-intensive scientific research needs storage capabilities that enable efficient data sharing. This is of great importance for many scientific domains such as the Virtual Physiological Human. In this paper, we introduce a solution that federates a variety of systems ranging from file servers to more sophisticated ones used in clouds or grids. Our solution follows a client-centric approach that loosely cou- ples a variety of data resources that may use different technologies such as Openstack-Swift, iRODS, GridFTP, and may be geographically dis- tributed. It is implemented as a lightweight service which does not re- quire installation of a software on the resources it uses. In this way we are able to efficiently use heterogeneous storage resources, reduce the usage complexity of multiple storage resources, and avoid vendor lock-in in case of cloud storage. To demonstrate the usability of our approach we performed a number of experiments that assess the performance and functionality of the developed system. Keywords: data federation, data sharing, data intensive applications, cloud computing. 1 Introduction Most in-silico experiments in various scientific domains revolve around massive data volumes. Advances in data capture hardware such as telescopes and se- quencing machines means that data is being generated at unprecedented rates. The experimental sciences alone are producing more data than ever, for exam- ple, the LHC produces 15 PB/year [1] and LOFAR [2] is expected to produce 20 PB in the next 5 years [2]. Scientific data are not only growing in size but are also stored all around the Globe using a variety of storage and access tech- nologies. For this reason, today’s research needs advanced storage capabilities to enable collaboration without introducing additional complexity to the way data are accessed and shared [3]. As in many scientific communities, the key challenge within the Virtual Phys- iological Human (VPH) [4] community is to share and access large datasets that allow the transformation of data to information and information to knowledge. D. an Mey et al. (Eds.): Euro-Par 2013 Workshops, LNCS 8374, pp. 13–22, 2014. c Springer-Verlag Berlin Heidelberg 2014