An Approach to Detect Spam Emails by Using Majority Voting
Roohi Hussain
Department of Computer Engineering,
National University of Science and
Technology,
H-12 Islamabad, Pakistan
ABSTRACT
Internet usage has become intensive during the last
few decades; this has given rise to the use of email
which is one of the fastest yet cheap modes of
communication. The growing demand of email
communication has given rise to the spam email
which is also known as unsolicited mails. In this
paper we propose an ensemble model that uses
majority voting on top of several classifiers to
detect spam. The classification algorithms used for
this purpose are Naïve Bayesian, Support Vector
Machines, Random Forest, Decision Stump and k-
Nearest Neighbor. Majority voting generates the
final decision of the ensemble by obtaining major
votes from the classifiers. The sample dataset used
for this task is taken from UCI and the tool
Rapidminer is used for the validation of the results.
KEYWORDS
Spam email, filtering, Naïve Bayesian, SVM,
Random Forest, Decision tree, Rapidminer
1 INTRODUCTION
Internet usage has become intensive during the
last few decades; this has given rise to the use
of email which is one of the fastest yet cheap
modes of communication. However the rise of
email and internet users resulted in the striking
increase of unsolicited bulk/spam emails. Spam
emails are the junk emails that are sent to
numerous undisclosed recipients and that
contains identical messages for everyone.
Usman Qamar
Faculty, Department of Computer Engineering
National University of Science and
Technology,
H-12 Islamabad, Pakistan
Botnet, which is group of programs
communicating with other similar programs, is
specifically used to send spam emails and it is
known for its malicious implication.
The enormous amount of spam data effects the
Information Technology based businesses and
brings loss of billions of Dollars to the
organizations in terms of its output [1]. In last
few years, spam emails have become a source
for intruding the sensitive data and this posed a
serious threat to the sanctuary of many
departments [2].
Researchers used classification that focuses on
three levels of the email i.e. email address,
subject line and body contents. Content based
spam detection is the most effective of all three.
The aim of this paper is to propose an ensemble
that uses majority voting approach in
combination with filtering algorithms for spam
detection.
1.1 Spam Features
Spam emails have following features [3], the
emails are sent to undisclosed recipients for the
advertisement of services/products/offensive
material. The aim is to deceive innocent people
by gaining personal data of the masses and
abuse it. Majority of the spam emails do not
offer unsubscribe option.
Proceedings of the International Conference on Data Mining, Internet Computing, and Big Data, Kuala Lumpur, Malaysia, 2014
ISBN: 978-1-941968-02-4 ©2014 SDIWC 76