Clustered Logistic Regression Algorithm for Flight Delay Prediction Catur Supriyanto 1 , Fauzi Adi Rafrastara 2 , Yani Parti Astuti 3 , Lisdi Inu Kencana 4 Department of Information Engineering Faculty of Computer Science Universitas Dian Nuswantoro, Semarang, Indonesia catur.supriyanto@dsn.dinus.ac.id 1 , fauziadi@dsn.dinus.ac.id 2 , yanipartiastuti@dsn.dinus.ac.id 3 , 111201811166@mhs.dinus.ac.id 4 Abstract—Cluster computing is a part High Performance Computing (HPC) which become more and more popular and necessary in the recent years. Meanwhile, data mining is a technology that have been growing where its huge benefits are inevitable. Once cluster computing and data mining are combined, it will yield a very powerful machine whereby the processing time in data mining can be accelerated by using the strength of cluster computing. The cost of developing cluster computing is also more efficient compared to buying a computer with very high specs such as server computer or even supercomputer. In this research, we present a simulation of cluster computing in virtual environment while implementing data mining algorithm to perform a prediction of flight delay. The aim of this research is to evaluate the performance improvement of logistic regression for doing the prediction in cluster environment. The result shows that cluster computing can significantly accelerate the computational speed of algorithm, compared to standalone mode. In this experiment, by using 1 master and 3 worker nodes (with identical hardware specifications), the computation time of logistic regression can be decreased up to 27.03%. Attaching more nodes to the cluster will lead to a better computation performance of cluster itself. Keywords—Cluster computing, clustered logistic regression, logistic regression, flight delay prediction, pyspark, apache spark. I. INTRODUCTION One of the hottest topics in the computer fields for these recent years is High Performance Computing (HPC) [1]. There are 2 kinds of HPC, namely traditional HPC and Modern HPC. The most famous example of traditional HPC is supercomputer. Meanwhile, modern HPC has 3 types, those are grid, cloud, and cluster computing [2]. Grid computing has advantages in term of load balancing, reliability and accessibility to the additional storage. Benefits of cloud computing are, it has a very huge power as is supercomputer, high resource availability, virtualization, flexibility, and crash recovery. Whereas cluster computing is suitable for the computation that needs single system image (SSI), manageability and high availability. Those three types of computation have their own objectives [2], [3], [4]. All of them are able to process the same data, but different in term of scale of data. Grid and Cloud computing are too overkill to process the simple data mining computation, in which cluster computing is enough for it. Given that the dataset which used in this experiment is not too large whereas the computation is quite simple, so cluster computing is suitable for it. Cluster computing can provide supercomputing power, but on a smaller scale. Cluster computing can process a data faster than standalone one. The more computer connected to the cluster, the more speed can be produced by them. Commonly, data mining task is processed in the standalone environment. Mining a large dataset in the standalone computer is not effective since it takes so much time (could be hours or even days), depend on how large the dataset is, and how good the hardware is. A solution is needed to tackle this problem, so that computational time can be reduced significantly with minimum cost [4], [5]. In this research, we firstly build a cluster environment then perform data mining computation on it. The algorithm which used here is logistic regression. This algorithm is applied to predict the flight delay. The data is gathered from the public dataset. By using the same algorithm and dataset, we then compare the computation performance on 3 scenarios. The first scenario is performing the computation in a standalone environment. In this step, we use a single virtual PC with 1 core CPU and 3 GB RAM. In the second scenario, we provide 3 virtual PC’s in which 1 VPC serves as master node, and the rest 2 VPC’s serve as worker nodes. Those 3 VPC’s have the same configurations as the standalone one (1 core CPU and 3 GB RAM). Third scenario provide 4 VPC’s, with 3 worker nodes. All the configurations are the same with the previous one. This simulation is conducted inside the virtual machine software, called VirtualBox (virtualbox.org). To understand the power of cluster computing, we execute the same source code in 3 different scenarios. In each scenario, we repeat the process 5 times to get the average processing time. We then compare those 3 average processing time from those 3 scenarios. Through this experiment, we can clearly see the power of cluster computing in accelerating the mining speed of logistic regression algorithm. This paper consists of six sections. Section I explains the introduction and background of study. Section II discusses some related researches that conducted by other researchers. International Journal of Computer Science and Information Security (IJCSIS), Vol. 19, No. 2, February 2021 110 https://sites.google.com/site/ijcsis/ ISSN 1947-5500