© 2019 JETIR January 2019, Volume 6, Issue 1 www.jetir.org (ISSN-2349-5162)
JETIR1901I68 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 505
Real-Time Cybersecurity: Leveraging Apache
Spark and Machine Learning for Effective Intrusion
Detection in Azure Cloud Environment
Madhuri Kanojiya
1
, Lokesh Chouhan
2
1,2
National Institute of Technology,
Hamirpur, Himachal Pradesh, 177005, India
Abstract—Cybersecurity experts predict that the cost of damage from cyber attacks will rise to $9.2 billion in 2019, with a new attack
occurring every few seconds. Managing the vast amount of data generated daily presents a significant challenge for traditional
intrusion detection systems. Protecting sensitive information is a priority for both governments and businesses, emphasizing the need
for a real-time, large-scale, and robust intrusion detection system (IDS). This paper introduces a distributed, fault-tolerant, and scalable
IDS that leverages Apache Spark's Structured Streaming and machine learning capabilities to detect intrusions in real time. The system
is implemented on Microsoft Azure, which offers both processing power and storage capabilities. A decision tree algorithm is
employed to classify incoming data. By using a machine learning dataset as the data source, the system gains enhanced insights into
its ability to respond to cyber attacks. Experimental results demonstrated a high accuracy of 99.95% and processed over 55,175 events
per second using a small cluster.
Index Terms— interruption location framework; ML; Apache Spark; Streaming Structured; Data; Decision Trees; Azure Cloud of
Microsoft
I. INTRODUCTION
In the year 2017, the world encountered the absolute greatest digital dangers in the web period. From Wanna Cry ransomware
to the Equifax assault and other information breaks of administrations, for example, Yahoo and Uber [1], famous as "burst assaults"
developed in multifaceted nature and recurrence. Burst assaults are less and can occur in a little league outline, similar to a couple
of moments. The Cybersecurity Reports of Cisco shows that 44% of associations encountered this kind of Distributed Denial of
Service assault in the year of 2017 [2].
The measure of information created every day is surpassing multiple petabytes and this incorporates the follows that web clients
leave whenever they get to a site, portable relevance or a system [15-18]. This follows "journal information" are increasing huge per day
while they are actually generated by not count one, be that as it may, at times numerous sources. The astute utilization of log information
can give a favourable position in distinguishing malignant associations, in this manner shielding the system from future assaults. In any
case, the brief timeframe window that programmers are utilizing can disable even great frameworks as they assault in a brief timeframe
[11-16]. The need for a constant recognition framework that could count to the measure of information being absorb and react rapidly
in expression of reaction time can allow a corner with this kind of assaults [19-22].
To distinguish as well as flag abnormal exercises, an interruption location framework is used [14-16] . Distributed computing gives
handling power, stockpiling, administrations and all the applications all over the network [3]. The most common cloud specialist co-
ops are Amazon, Azure of Microsoft and Platform of cloud of google [2]. The utilization of on-site and open cloud foundation is
developing. Reports from Cisco have mentioned that the safety is viewed as a generated key advantage of facilitating systems, as it
shows an additional security level.
In view of the techniques and inadequacies of original works, propose another methodology that means to give an exact order of
digital assaults progressively. The structure of the given suggested method will contemplate the current patterns in distributed Big
Data devices. To satisfy these prerequisites, there is a requirement for an interruption location framework that can deal with constant
gushing information. The last item will be a constant, Big Data system actualized inside a cloud foundation and tried against certifiable
traffic information utilizing a ML calculation. The remainder is composed as follows. In Part 2, we describe an audit of the given
examination which manages interruption location framework and Cloud. In addition, part 3 given the suggested framework and
respective parts. Area 4 shows the assessment measurements and consequences of the tried framework. At last, Area 5 gives ends and
offers additional opportunities for the advancement of future work.
II. LITERATURE REVIEW
Mustapha et al. [4] utilized Spark of Apache and ML library to examine the exhibition of interruption identification utilizing four
ML calculations, to be specific Supervised learning designs, Bayesian network designs, Choice Tree as well as Irregular Forest.
There project describes that Random Woods relent the finest execution regarding exactness, affectability as well as explicitness. This
is trailed by conclusion Trees, while Bayesian network designs provided the most noticeably awful exactness. This work utilized Spark
of Apache, albeit just with bunch preparing and no other handling to order information.