Facing the spammers: A very effective approach to avoid junk e-mails Tiago A. Almeida , Akebo Yamakami School of Electrical and Computer Engineering, University of Campinas – UNICAMP, 13083-852 Campinas, SP, Brazil article info Keywords: Minimum description length Confidence factors Spam filter Text categorization Machine learning abstract Spam has become an increasingly important problem with a big economic impact in society. Spam filter- ing poses a special problem in text categorization, in which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. In this paper, we present a novel approach to spam filtering based on the minimum description length principle and confidence factors. The proposed model is fast to construct and incrementally updateable. Furthermore, we have conducted an empirical experiment using three well-known, large and public e-mail databases. The results indicate that the proposed classifier outperforms the state-of-the-art spam filters. Ó 2011 Elsevier Ltd. All rights reserved. 1. Introduction E-mail is one of the most popular, fastest and cheapest means of communication. It has become a part of everyday life for millions of people, changing the way we work and collaborate. E-mail is not only used to support conversation but also as a task manager, document delivery system and archive. The downside of this suc- cess is the constantly growing volume of e-mail spam we receive. The problem of spams can be quantified in economical terms since many hours are wasted everyday by workers. It is not just the time they waste reading the spam but also the time they spend deleting those messages. According to annual reports, the amount of spam is frightfully increasing. The average of spams sent per day increased from 2.4 billion in 2002 1 to 300 billion in 2010 2 representing more than 90% of all incoming e-mail. On a worldwide basis, the total cost in dealing with spam was estimated to rise from US$ 20.5 billion in 2003, to US$ 198 billion in 2010. Fortunately, many solutions are being proposed to avoid this ‘‘plague’’ and one of most promising is the use of machine learning techniques for automatically filtering e-mail messages (Cormack, 2008). These methods include approaches that are considered top- performers in text categorization like Rocchio (Joachims, 1997; Schapire, Singer, & Singhal, 1998), Boosting (Carreras & Marquez, 2001), Support Vector Machines (SVM) (Almeida & Yamakami, 2010; Almeida, Yamakami, & Almeida, 2010a; Drucker, Wu, & Vap- nik, 1999; Hidalgo, 2002; Kolcz & Alspector, 2001; Ying, Lin, Lee, & Lin, 2010), Collaborative Systems (Lai, Chen, Laih, & Chen, 2009), Concept Drift (Fdez-Riverola, Iglesias, Diaz, Mendez, & Corchado, 2007), Cluster-based Approach (Hsiao & Chang, 2008), Logistic Regression (Goodman & Yih, 2006; Lynam, Cormack, & Cheriton, 2006; Perlich, Provost, & Simonoff, 2003) and Naïve Bayes classifiers (Almeida, Yamakami, & Almeida, 2009, 2010b; Almeida, Almeida, & Yamakami, 2011; Androutsopoulos, Paliouras, & Michelakis, 2004; Guzella & Caminhas, 2009). A relatively recent method for inductive inference which is still rarely employed in text categorization tasks is the minimum description length principle. It states that the best explanation, given a limited set of observed data, is the one that permits the great- est compression of the data (Barron, Rissanen, & Yu, 1998; Grün- wald, 2005; Rissanen, 1978). Other modern technique is the confidence factors (Assis, Yerazunis, Siefkes, & Chhabra, 2006) that was proposed to reduce the noise introduced by features with small counts and de-emphasize those with low class separation power. In this paper, we present a novel spam filtering approach that is based on the minimum description length principle (Bratko, Cormack, Filipic, Lynam, & Zupan, 2006) and confidence factors (Assis et al., 2006). We have conducted an empirical experiment using three well-known, large, and public databases and the reported results indicate that our approach outperforms currently established spam filters. A very basic and preliminary version of this work was presented at ACM SAC 2010 (Almeida et al., 2010a). Here, we significantly im- prove the algorithm and extend its evaluation. First, and the most important, we add the confidence factors to assist the classifier’s prediction. Second, we offer different tokenizer and training meth- ods. Additionally, we use more realist e-mail collections and differ- ent tasks in our experiments. Finally, we compare the proposed filter with the state-of-the-art spam classifiers. The remainder of this paper is organized as follows: Section 2 presents the main concepts behind the proposed spam filter. 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.12.049 Corresponding author. Tel.: +55 (19) 3521 3846; fax: +55 (19) 3521 3866. E-mail addresses: tiago@dt.fee.unicamp.br (T.A. Almeida), akebo@dt.fee.uni camp.br (A. Yamakami). 1 See http://www.spamlaws.com/spam-stats.html 2 See www.ciscosystems.cd/en/US/prod/collateral/cisco_2009_asr.pdf Expert Systems with Applications 39 (2012) 6557–6561 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa