A Simple Low Cost Parallel Architecture for Big Data Analytics Carlos Ordonez University of Houston § USA Sikder Tahsin Al-Amin * University of Houston § USA Xiantian Zhou University of Houston § USA Abstract—Big Data Systems (Hadoop, DBMSs) require a complicated setup and tuning to store and process big data on a parallel cluster. This is mainly due to static partitioning when data sets are loaded or copied into the file system. Parallel processing thereafter works in a distributed manner, aiming for balanced parallel execution across nodes. Node synchronization, data redistribution and distributed caching in main memory are difficult to tune in the system. On the other hand, there exist analytical problems and algorithms, which can be computed in parallel, with minimal synchronization and fully independent computation. Moreover, some problems can be solved in one pass or few passes. In this paper, we introduce a low cost, yet useful, processing architecture in which data sets are dynamically partitioned at run-time and storage is transient. Each node processes one partition independently and partial results are gathered at the master processing node. Surprisingly, we show this architecture works well for some popular machine learning models as well as some graph algorithms. We attempt to identify which problem characteristics enable such efficient processing, and we also show the main bottleneck is the initial data set partitioning and distribution across nodes. We anticipate our architecture can benefit parallel processing in the cloud, where a dynamic number of virtual processors is decided at runtime or when the data set is analyzed for a short time. Index Terms—Parallel architecture, Big Data, Parallel Process- ing. I. I NTRODUCTION There has been a significant rising in data volumes and processing speeds for the last two decades. However, data volumes have risen at a much higher rate than the processing speeds. Though there are powerful machines with a lot of memory and disk space, it is costly and may fail when the data volume is huge. Therefore, processing and analyzing large volumes of data becomes non-feasible using a traditional serial approach. Hence, parallel processing emerges to solve the problem. Parallel processing allows a problem to be subdivided into smaller pieces that can be solved faster. Distributing the data across multiple processing units and parallel processing unit yields improved processing speeds [12]. Many abstract models of parallel processing have been introduced for partitioning, processing, and storage. Most approaches start with partitioning, that is a large data set is partitioned among multiple processing nodes where each node operates on the assigned partitioned data. Though there are § Department of Computer Science, University of Houston, Houston TX 77204, USA * Contact Author: stahsin.cse@gmail.com some variants of parallel processing, it is often assumed that the same set of operations must be performed in each processing machine with shared-nothing architecture. For the output, most models send the partial output to the master node and combine the results to get the final result. In this paper, our contributions are the following: (1) We propose a simple parallel architecture that can be used for parallel processing in big data analytics. (2) Our architecture does not depend on any external complicated file systems, rather we do the partition dynamically and run on commodity hardware. We use the file system ”as is”. (3) Our architecture is cheap, easy to set up, more machines can be added easily, and there is no need to maintain the partitions. This is an outline of the rest of this article. Section 2 is a reference section. Section 3 presents our theoretical research contributions where we present our parallel architecture and how it comply with machine learning problems. Section 4 presents an experimental evaluation comparing our solution to the state of the art analytic systems. We discuss closely related work in Section 5. Conclusions and directions for future work are discussed in Section 6. II. PRELIMINARIES In this section, we introduce the definitions and symbols used throughout the paper. A. Input Data Set and Output Solution We start by defining the input data set as D. Here, D is a matrix having n rows and a different number of columns, depending if the problem comes from machine learning or graphs. Matrix D can be either dense or sparse. We define the problem solution in a generalized manner as Θ. For machine learning problems, Θ is a model consisting of a list of matrices and associated metrics and for graphs, Θ is generally a vector and associated metrics. B. Parallel Cluster We define N as the number of processing nodes (also called workers), where each node has its own CPU and memory (i.e. a shared-nothing architecture) and it cannot directly access another node main memory or storage. There exists a separate master node controlling the computation, gathering partial results and returning a final solution. The heavy duty work I/O and CPU computation is done by the workers and the master