Facing the spammers: A very effective approach to avoid junk e-mails Tiago A. Almeida ⇑ , Akebo Yamakami School of Electrical and Computer Engineering, University of Campinas – UNICAMP, 13083-852 Campinas, SP, Brazil article info Keywords: Minimum description length Conﬁdence factors Spam ﬁlter Text categorization Machine learning abstract Spam has become an increasingly important problem with a big economic impact in society. Spam ﬁlter- ing poses a special problem in text categorization, in which the deﬁning characteristic is that ﬁlters face an active adversary, which constantly attempts to evade ﬁltering. In this paper, we present a novel approach to spam ﬁltering based on the minimum description length principle and conﬁdence factors. The proposed model is fast to construct and incrementally updateable. Furthermore, we have conducted an empirical experiment using three well-known, large and public e-mail databases. The results indicate that the proposed classiﬁer outperforms the state-of-the-art spam ﬁlters. Ó 2011 Elsevier Ltd. All rights reserved. 1. Introduction E-mail is one of the most popular, fastest and cheapest means of communication. It has become a part of everyday life for millions of people, changing the way we work and collaborate. E-mail is not only used to support conversation but also as a task manager, document delivery system and archive. The downside of this suc- cess is the constantly growing volume of e-mail spam we receive. The problem of spams can be quantiﬁed in economical terms since many hours are wasted everyday by workers. It is not just the time they waste reading the spam but also the time they spend deleting those messages. According to annual reports, the amount of spam is frightfully increasing. The average of spams sent per day increased from 2.4 billion in 2002 1 to 300 billion in 2010 2 representing more than 90% of all incoming e-mail. On a worldwide basis, the total cost in dealing with spam was estimated to rise from US$ 20.5 billion in 2003, to US$ 198 billion in 2010. Fortunately, many solutions are being proposed to avoid this ‘‘plague’’ and one of most promising is the use of machine learning techniques for automatically ﬁltering e-mail messages (Cormack, 2008). These methods include approaches that are considered top- performers in text categorization like Rocchio (Joachims, 1997; Schapire, Singer, & Singhal, 1998), Boosting (Carreras & Marquez, 2001), Support Vector Machines (SVM) (Almeida & Yamakami, 2010; Almeida, Yamakami, & Almeida, 2010a; Drucker, Wu, & Vap- nik, 1999; Hidalgo, 2002; Kolcz & Alspector, 2001; Ying, Lin, Lee, & Lin, 2010), Collaborative Systems (Lai, Chen, Laih, & Chen, 2009), Concept Drift (Fdez-Riverola, Iglesias, Diaz, Mendez, & Corchado, 2007), Cluster-based Approach (Hsiao & Chang, 2008), Logistic Regression (Goodman & Yih, 2006; Lynam, Cormack, & Cheriton, 2006; Perlich, Provost, & Simonoff, 2003) and Naïve Bayes classiﬁers (Almeida, Yamakami, & Almeida, 2009, 2010b; Almeida, Almeida, & Yamakami, 2011; Androutsopoulos, Paliouras, & Michelakis, 2004; Guzella & Caminhas, 2009). A relatively recent method for inductive inference which is still rarely employed in text categorization tasks is the minimum description length principle. It states that the best explanation, given a limited set of observed data, is the one that permits the great- est compression of the data (Barron, Rissanen, & Yu, 1998; Grün- wald, 2005; Rissanen, 1978). Other modern technique is the conﬁdence factors (Assis, Yerazunis, Siefkes, & Chhabra, 2006) that was proposed to reduce the noise introduced by features with small counts and de-emphasize those with low class separation power. In this paper, we present a novel spam ﬁltering approach that is based on the minimum description length principle (Bratko, Cormack, Filipic, Lynam, & Zupan, 2006) and conﬁdence factors (Assis et al., 2006). We have conducted an empirical experiment using three well-known, large, and public databases and the reported results indicate that our approach outperforms currently established spam ﬁlters. A very basic and preliminary version of this work was presented at ACM SAC 2010 (Almeida et al., 2010a). Here, we signiﬁcantly im- prove the algorithm and extend its evaluation. First, and the most important, we add the conﬁdence factors to assist the classiﬁer’s prediction. Second, we offer different tokenizer and training meth- ods. Additionally, we use more realist e-mail collections and differ- ent tasks in our experiments. Finally, we compare the proposed ﬁlter with the state-of-the-art spam classiﬁers. The remainder of this paper is organized as follows: Section 2 presents the main concepts behind the proposed spam ﬁlter. 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.12.049 ⇑ Corresponding author. Tel.: +55 (19) 3521 3846; fax: +55 (19) 3521 3866. E-mail addresses: tiago@dt.fee.unicamp.br (T.A. Almeida), akebo@dt.fee.uni camp.br (A. Yamakami). 1 See http://www.spamlaws.com/spam-stats.html 2 See www.ciscosystems.cd/en/US/prod/collateral/cisco_2009_asr.pdf Expert Systems with Applications 39 (2012) 6557–6561 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa