Please cite this article as: J. Awiti, A.A. Vaisman and E. Zimányi, Design and implementation of ETL processes using BPMN and relational algebra, Data & Knowledge Engineering (2020) 101837, https://doi.org/10.1016/j.datak.2020.101837. Data & Knowledge Engineering xxx (xxxx) xxx Contents lists available at ScienceDirect Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak Design and implementation of ETL processes using BPMN and relational algebra Judith Awiti a, , Alejandro A. Vaisman b , Esteban Zimányi a a Department of Computer and Decision Engineering, Université Libre de Bruxelles. Av. Roosevelt 50, B-1050, Bruxelles, Belgium b Instituto Tecnológico de Buenos Aires, Buenos Aires, Argentina ARTICLE INFO Keywords: Data Warehousing OLAP ETL BPMN ABSTRACT Extraction, transformation, and loading (ETL) processes are used to extract data from internal and external sources of an organization, transform these data, and load them into a data warehouse. The Business Process Modeling and Notation (BPMN) has been proposed for expressing ETL processes at a conceptual level. A different approach is studied in this paper, where relational algebra (RA), extended with update operations, is used for specifying ETL processes. In this approach, data tasks in an ETL workflow can be automatically translated into SQL queries to be executed over a DBMS. To illustrate this study, the paper addresses the problem of updating Slowly Changing Dimensions (SCDs) with dependencies, that is, the case when updating a SCD table impacts on associated SCD tables. Tackling this problem requires extending the classic RA with update operations. The paper also shows the implementation of a portion of the TPC-DI benchmark that results from both approaches. Thus, the paper presents three implementations: (a) An SQL implementation based on the extended RA-based specification of an ETL process expressed in BPMN4ETL; and (b) Two implementations of workflows that follow from BPMN4ETL, one that uses the Pentaho DI tool, and another one that uses Talend Open Studio for DI. Experiments over these implementations of the TPC-DI benchmark for different scale factors were carried out, and are described and discussed in the paper, showing that the extended RA approach results in more efficient processes than the ones produced by implementing the BPMN4ETL specification over the mentioned ETL tools. The reasons for this result are also discussed. 1. Introduction Extraction, transformation, and loading (ETL) processes extract data from internal and external sources of an organization, transform these data, and load them into a data warehouse (DW). Since ETL processes are complex and costly, it is important to reduce their development and maintenance costs. Modeling these processes at a conceptual level would contribute to achieve this goal. Since there is no agreed-upon conceptual model to specify such processes, existing ETL tools use their own specific language to define ETL workflows. Considering this, the paper discusses two methods for designing ETL processes. The first one, called BPMN4ETL, is based on the Business Process Modeling Notation (BPMN), a de-facto standard for specifying business processes, which provides a conceptual and implementation-independent specification of such processes, that can be then translated into executable specifications for ETL tools. The second one is a logical model based on relational algebra (RA), a formal language that provides a solid basis to specify ETL processes for relational databases. The rationale for studying these two alternatives is two-folded: on the one hand, since BPMN is widely used for specifying business processes, adopting this methodology for ETL would likely be smooth for users already familiar with that language. On the other hand, RA is not only a well-studied formal language, but its expressiveness allows providing a detailed view of the data flow of any ETL process as well. Corresponding author. E-mail addresses: judith.awiti@ulb.ac.be (J. Awiti), avaisman@itba.edu.ar (A.A. Vaisman), ezimanyi@ulb.ac.be (E. Zimányi). https://doi.org/10.1016/j.datak.2020.101837 Received 11 December 2019; Received in revised form 20 March 2020; Accepted 11 June 2020 Available online xxxx 0169-023X/© 2020 Elsevier B.V. All rights reserved.