© 2019 JETIR January 2019, Volume 6, Issue 1 www.jetir.org (ISSN-2349-5162) JETIR1901I68 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 505 Real-Time Cybersecurity: Leveraging Apache Spark and Machine Learning for Effective Intrusion Detection in Azure Cloud Environment Madhuri Kanojiya 1 , Lokesh Chouhan 2 1,2 National Institute of Technology, Hamirpur, Himachal Pradesh, 177005, India Abstract—Cybersecurity experts predict that the cost of damage from cyber attacks will rise to $9.2 billion in 2019, with a new attack occurring every few seconds. Managing the vast amount of data generated daily presents a significant challenge for traditional intrusion detection systems. Protecting sensitive information is a priority for both governments and businesses, emphasizing the need for a real-time, large-scale, and robust intrusion detection system (IDS). This paper introduces a distributed, fault-tolerant, and scalable IDS that leverages Apache Spark's Structured Streaming and machine learning capabilities to detect intrusions in real time. The system is implemented on Microsoft Azure, which offers both processing power and storage capabilities. A decision tree algorithm is employed to classify incoming data. By using a machine learning dataset as the data source, the system gains enhanced insights into its ability to respond to cyber attacks. Experimental results demonstrated a high accuracy of 99.95% and processed over 55,175 events per second using a small cluster. Index Terms— interruption location framework; ML; Apache Spark; Streaming Structured; Data; Decision Trees; Azure Cloud of Microsoft I. INTRODUCTION In the year 2017, the world encountered the absolute greatest digital dangers in the web period. From Wanna Cry ransomware to the Equifax assault and other information breaks of administrations, for example, Yahoo and Uber [1], famous as "burst assaults" developed in multifaceted nature and recurrence. Burst assaults are less and can occur in a little league outline, similar to a couple of moments. The Cybersecurity Reports of Cisco shows that 44% of associations encountered this kind of Distributed Denial of Service assault in the year of 2017 [2]. The measure of information created every day is surpassing multiple petabytes and this incorporates the follows that web clients leave whenever they get to a site, portable relevance or a system [15-18]. This follows "journal information" are increasing huge per day while they are actually generated by not count one, be that as it may, at times numerous sources. The astute utilization of log information can give a favourable position in distinguishing malignant associations, in this manner shielding the system from future assaults. In any case, the brief timeframe window that programmers are utilizing can disable even great frameworks as they assault in a brief timeframe [11-16]. The need for a constant recognition framework that could count to the measure of information being absorb and react rapidly in expression of reaction time can allow a corner with this kind of assaults [19-22]. To distinguish as well as flag abnormal exercises, an interruption location framework is used [14-16] . Distributed computing gives handling power, stockpiling, administrations and all the applications all over the network [3]. The most common cloud specialist co- ops are Amazon, Azure of Microsoft and Platform of cloud of google [2]. The utilization of on-site and open cloud foundation is developing. Reports from Cisco have mentioned that the safety is viewed as a generated key advantage of facilitating systems, as it shows an additional security level. In view of the techniques and inadequacies of original works, propose another methodology that means to give an exact order of digital assaults progressively. The structure of the given suggested method will contemplate the current patterns in distributed Big Data devices. To satisfy these prerequisites, there is a requirement for an interruption location framework that can deal with constant gushing information. The last item will be a constant, Big Data system actualized inside a cloud foundation and tried against certifiable traffic information utilizing a ML calculation. The remainder is composed as follows. In Part 2, we describe an audit of the given examination which manages interruption location framework and Cloud. In addition, part 3 given the suggested framework and respective parts. Area 4 shows the assessment measurements and consequences of the tried framework. At last, Area 5 gives ends and offers additional opportunities for the advancement of future work. II. LITERATURE REVIEW Mustapha et al. [4] utilized Spark of Apache and ML library to examine the exhibition of interruption identification utilizing four ML calculations, to be specific Supervised learning designs, Bayesian network designs, Choice Tree as well as Irregular Forest. There project describes that Random Woods relent the finest execution regarding exactness, affectability as well as explicitness. This is trailed by conclusion Trees, while Bayesian network designs provided the most noticeably awful exactness. This work utilized Spark of Apache, albeit just with bunch preparing and no other handling to order information.