978-1-5090-5814-3/17/$31.00 ©2017 IEEE
Spam Filtering Email Classification (SFECM) using
Gain and Graph Mining Algorithm
M. K. Chae
1
, Abeer Alsadoon
1
, P.W.C. Prasad
1
, Sasikumaran Sreedharan
2
1
Charles Sturt University Study Centre, Sydney, Australia
2
Department of Computer Engineering, College of Computer Science, King Khalid University, KSA
Abstract— This paper proposes a hybrid solution of spam
email classifier using context based email classification model as
main algorithm complimented by information gain calculation to
increase spam classification accuracy. Proposed solution consists
of three stages email pre-processing, feature extraction and email
classification. Research has found that LingerIG spam filter is
highly effective at separating spam emails from cluster of
homogenous work emails. Also experiment result proved the
accuracy of spam filtering is 100% as recorded by the team of
developers at University of Sydney. The study has shown that
implementing the spam filter in the context –based email
classification model is feasible. Experiment of the study has
confirmed that spam filtering aspect of context-based
classification model can be improved.
Keywords— email classification; graph mining algorithm;
spam; email classifier
I. INTRODUCTION
Email is a cost-effective method of communication
commonly found in all areas of industries. Education industry
is not an exception. Workforce in education industry spends
fair amount of time in front of computer chasing up on emails.
This is more so with jobs that deal with high volume of emails
each day such as administrator in education industry.
Managing incoming email is a critical matter to many because
emails can herald important meetings, work messages, lunch,
industry related information, upcoming events which many
cannot afford to miss.
Also, email is a means to transfer important documents in
education agency. Often the documents contain international
student’s private information and scanned copy of application
to apply for admission into education institution such as
Universities, TAFEs and private colleges. At present we still
find important work related emails in spam folder. Therefore
there is still a need to improve accuracy of email classifiers
using new and existing algorithms.
One possible solution to improving spam classification
algorithm is using a spam filter named LingerIG implemented
in 2003 in an email classification system named Linger [1].
The basic principle of how this spam filter works bases on
calculating information gain. However the problem with this
solution is its accuracy in classifying non-spam emails into
folders. Out of many email learner used by Linger, at best,
Widrow-Hoff gives unstable accuracy which moves between
82.40% ~ 48.50% [1] when classifying emails into folders.
Current solution such as context based email classification
model [2] has been developed to better adapt at classifying
emails into homogenous groups.
This paper aims to combine spam filter which uses
information gain calculation and context based email
classification model with the aim of improving the spam email
classification accuracy to become 100%. The proposed
solution uses spam filter to firstly filter all the spam emails
from inbox. Then the context-based email classification
model can classify emails into several folders.
This paper is organized as follows: Section I and II
presents the introduction and literature review. Section III is
proposed solution and section IV is results and discussion.
Conclusion can be found in section V.
II. LITRETURE SURVEY
Study of literatures regarding automated email
classification has found there are at least four different types
of approaches to automated email classification: Traditional
approach, Ontology-based approach, Graph-mining approach,
Neural-Network approach. Among many solutions proposed
by other researchers, Linger and context based email
classification model were notable discoveries.
A. Traditional Approaches to email classification
Text classification algorithms have been adopted to email
classification systems [3][4][5]. These includes Naïve Bayes
algorithm [4] and Support Vector Machine [3] which tokenize
the email for calculation determining similarity of emails to
either spam or other useful type of email.
Experiment conducted by Alsmadi and Alhami [3] have
found that removing stop words in emails improve accuracy of
email classification. Jason D. M Rennie [4] performed email
classification using a Naïve Bayes algorithm in an email
classification system named ifile. An email classification
method named Three-Phase Tournament method devised by
Sayed et al [5] has shown very unstable accuracy ranging from
2% to 95%.
B. Ontology-based Approaches to email classification
The template is used to format your paper and style the
text. All margins, column widths, line spaces, and text fonts
are prescribed; please do not alter them. You may note
peculiarities. For example, the head margin in this template
measures proportionately more than is customary. This