Knowledge and Information Systems
https://doi.org/10.1007/s10115-018-1271-1
REGULAR PAPER
A modified content-based evolutionary approach to identify
unsolicited emails
Shrawan Kumar Trivedi
1
· Shubhamoy Dey
2
Received: 12 May 2017 / Revised: 7 December 2017 / Accepted: 26 May 2018
© Springer-Verlag London Ltd., part of Springer Nature 2018
Abstract
This computational research seeks to classify unsolicited versus legitimate emails. A modified
version of an existing genetic programming (GP) classifier—i.e., modified genetic program-
ming (MGP)—is implemented to build an ensemble of classifiers to identify unsolicited
emails. The proposed classifier is assessed using informative features extracted from two
corpora (Enron and SpamAssassin) with the help of the greedy stepwise feature search
method. Further, a comparative study is performed with other popular classifiers, such as
Bayesian network, naïve Bayes, decision tree, random forest (RF), support vector machine
(SVM), and GP. Further the results are validated with 20-fold cross-validation and paired
T test. The results prove that the proposed classifier performs better in terms of accuracy
and false-positive detection in comparison with the other machine learning classifiers tested
in this study. Using different training and testing a set of email files from the Enron cor-
pus, ensemble-based classifiers, such as boosted SVM, boosted Bayesian, boosted naïve
Bayesian, RF, and the proposed MGP classifier, are tested and compared on all metrics,
including training and testing time. The findings suggest that the MGP classifier with the
greedy stepwise feature search method offers an improvement over alternative methods in
detecting unsolicited emails.
Keywords Modified genetic programming · Machine learning classifiers · Unsolicited
emails · Ensemble · Accuracy · F value · False-positive rate · Training and testing time
1 Introduction
In today’s automated world, information sharing between organizations and their units is
necessary to create a competitive and sustainable business environment. Email is an impor-
tant tool for rapid and economical communication; however, spam (unsolicited email) is seen
B Shrawan Kumar Trivedi
shrawan@iimisirmaur.ac.in
Shubhamoy Dey
shubhamoy@iimidr.ac.in
1
Indian Institute of Management Sirmaur, Sirmaur, HP, India
2
Indian Institute of Management Indore, Indore, MP, India
123