Time-efficient spam e-mail filtering using n-gram models Ali C ¸ ıltık, Tunga Gu ¨ngo ¨r * Department of Computer Engineering, Bog ˘ azic ¸i University, _ Istanbul 34342, Turkey Received 15 September 2006; received in revised form 6 July 2007 Available online 19 August 2007 Communicated by M.-J. Li Abstract In this paper, we propose spam e-mail filtering methods having high accuracies and low time complexities. The methods are based on the n-gram approach and a heuristics which is referred to as the first n-words heuristics. We develop two models, a class general model and an e-mail specific model, and test the methods under these models. The models are then combined in such a way that the latter one is activated for the cases the first model falls short. Though the approach proposed and the methods developed are general and can be applied to any language, we mainly apply them to Turkish, which is an agglutinative language, and examine some properties of the lan- guage. Extensive tests were performed and success rates about 98% for Turkish and 99% for English were obtained. It has been shown that the time complexities can be reduced significantly without sacrificing performance. Ó 2007 Elsevier B.V. All rights reserved. Keywords: Spam filtering; n-Gram model; Heuristics; Agglutinative language; Free word order; Morphology; Turkish 1. Introduction Spam e-mail (or junk e-mail) messages are the messages that the recipients are exposed to without their approval or interest. We may also use the word ‘‘unsolicited’’ to name this kind of messages, since spam concept depends on the person who receives the e-mail. An unsolicited e-mail for a person may be regarded as legitimate (normal) by another person, and vice versa. In today’s world where the Internet technology is growing rapidly and thus the communication via e-mail is becoming an important part of daily life, spam e-mail messages pose a serious problem. So it is crucial to fight with spam messages which tend to increase exponen- tially and cause waste of time and resources. Past 1994, some spam prevention tools began to emerge in response to the spammers (people sending spam mes- sages) who started to automate the process of sending spam e-mail. The very first spam prevention tools or filters used a simple approach to language analysis by simply scanning e-mail messages for some suspicious senders or for phrases such as ‘‘click here to buy’’ and ‘‘free of charge’’. In late 1990s, blacklisting, whitelisting, and throttling methods were implemented at the Internet Service Provider (ISP) level. However, these methods suffered some maintenance problems. Furthermore, whitelisting approach is open to forgeries. Some more complex approaches were also pro- posed against spam problem. Most of them were imple- mented by using machine learning methods. Naı ¨ve Bayes Network algorithms were used frequently and they have shown a considerable success in filtering English spam mes- sages (Androutsopoulos et al., 2000). Knowledge-based and rule-based systems were also used by researchers for English spam filters (Apte et al., 1994; Cohen, 1996). As an alterna- tive to these classical learning paradigms used frequently in spam filtering domain, genetic programming was employed (Oda and White, 2003). It required fewer computational resources, making it attractive for spam filtering applica- tion. Case-based reasoning for spam e-mail filtering was dis- cussed in (Delany et al., 2005 and Deepak et al., 2006). Meta data were also taken into account in addition to the content of the e-mail by some researchers (Berger et al., 2005). 0167-8655/$ - see front matter Ó 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2007.07.018 * Corresponding author. Tel.: +90 212 3597094; fax: +90 212 2872461. E-mail address: gungort@boun.edu.tr (T. Gu ¨ngo ¨r). www.elsevier.com/locate/patrec Available online at www.sciencedirect.com Pattern Recognition Letters 29 (2008) 19–33