SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle Matthias Boehm 1,2 , Iulian Antonov 2 , Sebastian Baunsgaard 1 , Mark Dokter 2 , Robert Ginthör 2 , Kevin Innerebner 1 , Florijan Klezin 2 , Stefanie Lindstaedt 1,2 , Arnab Phani 1 , Benjamin Rath 1 , Berthold Reinwald 3 , Shafaq Siddiqi 1 , Sebastian Benjamin Wrede 2 1 Graz University of Technology; Graz, Austria 2 Know-Center GmbH; Graz, Austria 3 IBM Research – Almaden; San Jose, CA, USA ABSTRACT Machine learning (ML) applications become increasingly common in many domains. ML systems to execute these workloads include numerical computing frameworks and li- braries, ML algorithm libraries, and specialized systems for deep neural networks and distributed ML. These sys- tems focus primarily on efficient model training and scoring. However, the data science process is exploratory, and deals with underspecified objectives and a wide variety of hetero- geneous data sources. Therefore, additional tools are em- ployed for data engineering and debugging, which requires boundary crossing, unnecessary manual effort, and lacks op- timization across the lifecycle. In this paper, we introduce SystemDS, an open source ML system for the end-to-end data science lifecycle from data integration, cleaning, and preparation, over local, distributed, and federated ML model training, to debugging and serving. To this end, we aim to provide a stack of declarative language abstractions for the different lifecycle tasks, and users with different expertise. We describe the overall system architecture, explain major design decisions (motivated by lessons learned from Apache SystemML), and discuss key features and research direc- tions. Finally, we provide preliminary results that show the potential of end-to-end lifecycle optimization. 1. INTRODUCTION Machine learning (ML) applications profoundly transform our lives, and many domains such as health care, finance, media, transportation, production, and information tech- nology itself. Increased digitalization, sensor-equipped vehi- cles and production pipelines, feedback loops in data-driven products, and data augmentation also provide large, labeled data collections for training the underlying ML models. Existing ML Systems: ML systems to execute these workloads are—due to a variety of ML algorithms and lack Work done while at IT University of Copenhagen, Denmark. This article is published under a Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits distribution and reproduction in any medium as well allowing derivative works, pro- vided that you attribute the original work to the author(s) and CIDR 2020. 10th Annual Conference on Innovative Data Systems Research (CIDR ‘20) January 12-15, 2020, Amsterdam, Netherlands. of standards—still diverse and rapidly evolving. Major sys- tem categories include numerical computing frameworks like R, Python NumPy [72], or Julia [5], algorithm libraries like Scikit-learn [57] or Spark MLlib [48], large-scale linear al- gebra systems like Apache SystemML [8] or Mahout Sam- sara [61], and specialized deep neural network (DNN) frame- works like TensorFlow [1], MXNext [13], or PyTorch [55, 56]. These systems primarily rely on numeric matrices or tensors, and focus on efficient ML training and scoring. Exploratory Data-Science Lifecycle: In contrast to classical ML problems, the typical data science process is exploratory. Stakeholders pose open-ended problems with underspecified objectives that allow different analytics, and can leverage a wide variety of heterogeneous data sources [58]. Data scientists then investigate hypotheses, integrate the necessary data, run different analytics, and look for in- teresting patterns or models [16]. Since the added value is unknown in advance, little investment is made into system- atic data acquisition, and preparation. This lack of infras- tructure results in redundancy of manual efforts and com- putation, especially in small or medium-sized enterprises, which often lack curated catalogs of data and artifacts. Data Preparation Problem: It is widely recognized that data scientists spend 80-90% of their time finding rel- evant datasets, and performing data integration, cleaning, and preparation tasks [70]. For this reason, many industrial- strength ML applications have dedicated subsystems for data collection, verification, and feature extraction [3, 62, 64]. Since data integration and cleaning are, however, stub- bornly difficult tasks to automate [4], existing work primar- ily focuses on well-defined subproblems or—like Wrangler [37, 59] and Trifacta [29]—on semi-manual data wrangling through interactive UIs. Unfortunately, this diversity of tools and specialized algorithms lacks broad systems support, re- quires boundary crossing, and lacks optimization across the lifecycle. These problems motivated various in-database ML toolkits [14, 22, 31, 46, 54, 66] to enable data preparation and ML training/scoring in SQL. However, this approach was— except for success stories like factorized learning [41, 50, 63]—mostly unsuccessful because data scientists perceived in-database ML and array databases [69] as unnatural and cumbersome due to the need for data loading, and the ver- bosity of linear algebra in SQL [66]. A Case for Declarative Data Science: From the view- point of a data scientist, it seems most natural to specify data science lifecycle tasks through familiar R or Python