Big-ETL: Extracting-Transforming-Loading Approach for Big Data M. Bala 1 , O. Boussaid 2 , and Z. Alimazighi 3 1 Department of informatics, Saad Dahleb University, Blida 1, Blida, Algeria 2 Department of informatics and Statistics, University of Lyon 2, Lyon, France 3 Department of informatics, USTHB, Algiers, Algeria Abstract— ETL process (Extracting-Transforming-Loading) is responsible for (E)xtracting data from heterogeneous sources, (T)ransforming and ﬁnally (L)oading them into a data warehouse (DW). Nowadays, Internet and Web 2.0 are generating data at an increasing rate, and therefore put the information systems (IS) face to the challenge of big data. Data integration systems and ETL, in particular, should be revisited and adapted and the well-known solution is based on the data distribution and the parallel/distributed pro- cessing. Among all the dimensions deﬁning the complexity of the big data, we focus in this paper on its excessive "volume" in order to ensure good performance for ETL processes. In this context, we propose an original approach called Big-ETL (ETL Approach for Big Data) in which we deﬁne ETL functionalities that can be run easily on a cluster of computers with MapReduce (MR) paradigm. Big-ETL allows, thereby, parallelizing/distributing ETL at two levels: (i) the ETL process level (coarse granularity level), and (ii) the functionality level (ﬁne level); this allows improving further the ETL performance. Keywords: Data Warehousing, Extracting-Transforming-Loading, Parallel/distributed processing, Big Data, MapReduce. 1. Introduction The widespread use of internet, web 2.0, social networks, and digital sensors produce non-traditional data volumes. Indeed, MapReduce (MR) jobs run continuously on Google clusters and deal over twenty Petabytes of data per day [1]. This data explosion is an opportunity for the emergence of new business applications such as Big Data Analytics (BDA); but it is, at the same time, a problem given the limited capabilities of machines and traditional applications. These large data are called now "big data" and are charac- terized by the four "V" [2]: Volume that implies the amount of data going beyond the usual units, the Velocity means the speed with which this data is generated and should be processed, Variety is deﬁned as the diversity of formats and structures, and Veracity relates to data accuracy and reliability. Furthermore, new paradigms emerged such as Cloud Computing [3] and MapReduce (MR) [4]. In addition, novel data models are proposed for very large data storage such as NoSQL (Not Only SQL) [5]. This paper aims to provide solutions to the problems caused by the big data in a decision-support environment. We are particularly interested in the very large data integration in a data warehouse. We propose a parallel/distributed ETL approach, called Big- ETL (ETL Approach for Big Data), consisting of a set of MR-based ETL functionalities. The solution offered by the research community, in this context, is to distribute the ETL process on a cluster of computers. Each ETL process instance handles a partition of data source in parallel way to improve the performance of the ETL. This solution is deﬁned only at a process level (coarse granularity level) and does not consider the ETL functionalities (ﬁne granularity level) which allows understanding deeply the ETL complexity and improve, therefore, signiﬁcantly the ETL process. To the best of our knowledge, Big-ETL is a different and original approach in the data integration ﬁeld. We ﬁrst deﬁne an ETL process at a very ﬁne level by parallelizing/distributing its core functionalities according to the MR paradigm. Big-ETL allows, thereby, parallelization/distribution of the ETL at two levels: (i) ETL functionality level, and (ii) ETL process level; this will improve further the ETL performance facing the big data. To validate our Big-ETL approach, we developed a prototype and conducted some experiments. The rest of this paper is structured as follows. Section 2 presents a state of the art in the ETL ﬁeld followed by a classiﬁcation of ETL approaches proposed in the literature according to the parallelization criteria. Section 3 is devoted to our Big-ETL approach. We present in Section 4 our prototypical implementation and the conducted experiments. We conclude and present our future work in Section 5. 2. Related work One of the ﬁrst contributions on the ETL ﬁeld is [6]. It is a modeling approach based on a non-standard graphical formalism where ARKTOS II is the implemented framework. It is the ﬁrst contribution that allows modeling an ETL process with all its details at a very ﬁne level, i.e. the attribute. In [7], authors proposed a more holistic modeling approach based on UML (Uniﬁed Modeling Language) but with less details on ETL process compared to [6]. Authors in [8] adopted BPMN notation (Business Process Model and Notation), a standard notation dedicated to the business process modeling. This work was followed by [9], a modeling framework based on a metamodel in MDD (Model Driven Development) architecture. [7] and [8] are top-down approaches and allow, therefore, modeling 462 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |