Cost-Sensitive Spam Detection
Using Parameters Optimization and Feature Selection
Sang Min Lee
(Department of Computer Engineering, Korea Aerospace University, Seoul, Korea
minuri33@kau.ac.kr)
Dong Seong Kim
(Department of Electrical and Computer Engineering, Duke University, Durham, USA
dongseong.kim@duke.edu)
Jong Sou Park
(Department of Computer Engineering, Korea Aerospace University, Seoul, Korea
jspark@kau.ac.kr)
Abstract: E-mail spam is no more garbage but risk since it recently includes virus attachments
and spyware agents which make the recipients’ system ruined, therefore, there is an emerging
need for spam detection. Many spam detection techniques based on machine learning
techniques have been proposed. As the amount of spam has been increased tremendously using
bulk mailing tools, spam detection techniques should counteract with it. To cope with this,
parameters optimization and feature selection have been used to reduce processing overheads
while guaranteeing high detection rates. However, previous approaches have not taken into
account feature variable importance and optimal number of features. Moreover, to the best of
our knowledge, there is no approach which uses both parameters optimization and feature
selection together for spam detection. In this paper, we propose a spam detection model
enabling both parameters optimization and optimal feature selection; we optimize two
parameters of detection models using Random Forests (RF) so as to maximize the detection
rates. We provide the variable importance of each feature so that it is easy to eliminate the
irrelevant features. Furthermore, we decide an optimal number of selected features using two
methods; (i) only one parameters optimization during overall feature selection and (ii)
parameters optimization in every feature elimination phase. Finally, we evaluate our spam
detection model with cost-sensitive measures to avoid misclassification of legitimate messages,
since the cost of classifying a legitimate message as a spam far outweighs the cost of
classifying a spam as a legitimate message. We perform experiments on Spambase dataset and
show the feasibility of our approaches.
Keywords: Feature Selection, Intrusion Detection, Parameters Optimization, Random Forests,
Spam Detection, Spambase
Categories: I.2.6, I.5.1, K.6.5, L.4.0
1 Introduction
An electronic mail (e-mail) is an efficient and increasingly popular communication
method. Concern about the proliferation of unsolicited bulk e-mail, commonly
referred to as “spam”, has been steadily increasing [Cranor and LaMacchia 98]. When
people receive in a small amount of spam, it rarely poses a significant problem.
Journal of Universal Computer Science, vol. 17, no. 6 (2011), 944-960
submitted: 15/5/10, accepted: 30/11/10, appeared: 28/3/11 © J.UCS