978-1-5090-5814-3/17/$31.00 ©2017 IEEE Spam Filtering Email Classification (SFECM) using Gain and Graph Mining Algorithm M. K. Chae 1 , Abeer Alsadoon 1 , P.W.C. Prasad 1 , Sasikumaran Sreedharan 2 1 Charles Sturt University Study Centre, Sydney, Australia 2 Department of Computer Engineering, College of Computer Science, King Khalid University, KSA Abstract— This paper proposes a hybrid solution of spam email classifier using context based email classification model as main algorithm complimented by information gain calculation to increase spam classification accuracy. Proposed solution consists of three stages email pre-processing, feature extraction and email classification. Research has found that LingerIG spam filter is highly effective at separating spam emails from cluster of homogenous work emails. Also experiment result proved the accuracy of spam filtering is 100% as recorded by the team of developers at University of Sydney. The study has shown that implementing the spam filter in the context –based email classification model is feasible. Experiment of the study has confirmed that spam filtering aspect of context-based classification model can be improved. Keywords— email classification; graph mining algorithm; spam; email classifier I. INTRODUCTION Email is a cost-effective method of communication commonly found in all areas of industries. Education industry is not an exception. Workforce in education industry spends fair amount of time in front of computer chasing up on emails. This is more so with jobs that deal with high volume of emails each day such as administrator in education industry. Managing incoming email is a critical matter to many because emails can herald important meetings, work messages, lunch, industry related information, upcoming events which many cannot afford to miss. Also, email is a means to transfer important documents in education agency. Often the documents contain international student’s private information and scanned copy of application to apply for admission into education institution such as Universities, TAFEs and private colleges. At present we still find important work related emails in spam folder. Therefore there is still a need to improve accuracy of email classifiers using new and existing algorithms. One possible solution to improving spam classification algorithm is using a spam filter named LingerIG implemented in 2003 in an email classification system named Linger [1]. The basic principle of how this spam filter works bases on calculating information gain. However the problem with this solution is its accuracy in classifying non-spam emails into folders. Out of many email learner used by Linger, at best, Widrow-Hoff gives unstable accuracy which moves between 82.40% ~ 48.50% [1] when classifying emails into folders. Current solution such as context based email classification model [2] has been developed to better adapt at classifying emails into homogenous groups. This paper aims to combine spam filter which uses information gain calculation and context based email classification model with the aim of improving the spam email classification accuracy to become 100%. The proposed solution uses spam filter to firstly filter all the spam emails from inbox. Then the context-based email classification model can classify emails into several folders. This paper is organized as follows: Section I and II presents the introduction and literature review. Section III is proposed solution and section IV is results and discussion. Conclusion can be found in section V. II. LITRETURE SURVEY Study of literatures regarding automated email classification has found there are at least four different types of approaches to automated email classification: Traditional approach, Ontology-based approach, Graph-mining approach, Neural-Network approach. Among many solutions proposed by other researchers, Linger and context based email classification model were notable discoveries. A. Traditional Approaches to email classification Text classification algorithms have been adopted to email classification systems [3][4][5]. These includes Naïve Bayes algorithm [4] and Support Vector Machine [3] which tokenize the email for calculation determining similarity of emails to either spam or other useful type of email. Experiment conducted by Alsmadi and Alhami [3] have found that removing stop words in emails improve accuracy of email classification. Jason D. M Rennie [4] performed email classification using a Naïve Bayes algorithm in an email classification system named ifile. An email classification method named Three-Phase Tournament method devised by Sayed et al [5] has shown very unstable accuracy ranging from 2% to 95%. B. Ontology-based Approaches to email classification The template is used to format your paper and style the text. All margins, column widths, line spaces, and text fonts are prescribed; please do not alter them. You may note peculiarities. For example, the head margin in this template measures proportionately more than is customary. This