SmartMonit: Real-time Big Data Monitoring System Umit Demirbaga *† , Ayman Noor *‡ , Zhenyu Wen * , Philip James * , Karan Mitra § , Rajiv Ranjan * * Newcastle University, Bartin University, Taibah University, § Lule˚ a University of Technology Abstract—Modern big data process systems are becoming very complex in terms of large-scale, high-concurrency and Multiple- talents. Thus, many failures and performance reductions only happen at run-time and are very difficult to capture. Moreover, some issues may only be triggered when some components are executed. To analyze the root cause of these types of issues, we have to capture the dependencies of each component in real-time. In this paper, we propose SmartMonit a real-time Big data monitoring system which collects infrastructure information such as the process status of each task. At the same time, we develop a real-time stream processing framework to analyze the coordination among tasks to tasks and infrastructures to tasks. This coordination information is essential for troubleshooting the reasons for failures and performance reduction, especially the ones propagated from other causes. Index Terms—Hadoop, MapReduce, Monitoring I. I NTRODUCTION Big data systems, such as Hadoop and Spark typically run in large-scale computer clusters. For example, Fuxi [1] is an extended implementation of Yarn, which is deployed in a cluster with over 5000 nodes and serves hundreds of millions of customers at Alibaba. For these large-scale systems, there are two key issues that cause performance reduction and inefficient resource utilization. First is the task failures caused by diverse sources of software and hardware faults, and the second is unsuitable scheduling policies. The fault detection in big data systems, however, is very hard due to considerable scale, the distributed environment and the large number of concurrent jobs. State-of-the-art research is not focused on detecting emergent failures. The emergent failures happen when the errors exceed the propagation bound- aries during the interaction among hardware and software components, and can only be identified at run-time [2]. For example, the stragglers or tailing behavior in Hadoop systems; the slower execution of a job may cause the late-time failures for many other tasks which have strict time constraints related to service-level agreements (SLAs). Moreover, the cluster scheduler uses some heuristics that prioritizes the important jobs and fairly allocates the resource among the jobs [3]. However, these methods ignore the information of the job structure (or dependencies) and schedule the jobs to the available resources without considering the job structure in run-time [4]. In order to detect the emergent failures or underlying reasons for the performance reduction, we need to have a comprehensive and consistent monitoring plan to collect the information from each individual process job, while storing, maintaining and analyzing very large volumes of the monitor- ing data [5]. The monitoring tools such as Google Cloud Mon- itoring 1 , SPARKLINT 2 and Datadog 3 aim to collect various information such as CPU, memory, disk, available bandwidths and execution status of each job. However, they are not able to do a real-time multi-resolution analysis that narrows the scope and increases the resolution, thereby pruning the non- important information. In this paper, we propose a real-time monitoring system that efficiently collects the run-time system information including computing resource utilization and job execution information and then interacts the collected information with the Execution Graph modeled as directed acyclic graphs (DAGs). For exam- ple, our system is able to capture the job execution stages and the dependencies of each job in real-time; at the same time, the resource utilization of each job and its underlying host are monitored as well. The main contributions are summarized as follows. We develop a big data monitoring system which can effi- ciently collect the comprehensive monitoring information from large-scale computer clusters in real-time. At the same time, we process these streaming data and interact the processed data with the Execution Graph of each task while visualizing the interaction in real-time. To demonstrate the effectiveness of our system, we applied our system to a small Hadoop cluster which is deployed on AWS. The above mentioned monitoring information is collected in real-time and visualized in a user-friendly inter- face. The demonstration is available as a screen-cast video on YouTube [6]. II. SMARTMONIT SYSTEM OVERVIEW This section explains the architecture and implementation details of SmartMonit. A. System Architecture Fig. 1 shows 3 main components of our SmartMonit in- cluding Information Collection, Computation and Storing and Visualization. Information Collection. The Information Collection is used to collect the job and task metrics and the resource utiliza- tion of the nodes in a large-scale computer cluster in real time via SmartAgent and Agent. The SmartAgent is deployed on the Master node and collects the specific information of tasks (mappers and reducers), application (job details) 1 https://kubernetes.io/docs/tasks/debug-application-cluster/resource-usage- monitoring/ 2 https://github.com/groupon/sparklint 3 https://docs.datadoghq.com/