A Decentralized and Personalized Spam Filter Based on Social Computing Xin Liu, Zhaojun Xin, Leyi Shi, Yao Wang College of Computer & Communication Engineering China University of Petroleum(East China) Qingdao, China lx@upc.edu.cn, shileyi@upc.edu.cn, xinzhaojun567@sina.com, wangyao@126.com Abstract—Spam is an imperative problem to the email communication today. Different users may have different views on judging spam which makes it difficult to filter spam from normal emails for email server. We found users with similar interest may have similar opinions. So in this paper, we proposed a spam filtering approach in which a collaborative and personalized spam filter based on social network is developed. The key idea is to enable users to push spam reports to their social network friends with similar interest, which reflects collaboration and personalization. Our proposal takes advantage of push technology to share user’s individual spam knowledge with others via social network, which utilizes wisdom of crowds to resist spam. According to interest similarity among users, a user can determine whether to push spam reports to his friends with the purpose of taking user’s individual interest into consideration. We integrate an interest- based spam filter with a basic Bayesian filter to discriminate spam from legitimate emails. The evaluation of our proposal shows that it significantly improves the performance compared with Bayesian filter according to the accuracy rate. Keywords- social network; spam filtering; user interest; interest similarity; push technology I. INTRODUCTION Email is so convenient and easy-to-use that it is one of the most important and widespread communication applications. However, spam is the illicit use of the email to send unsolicited bulk messages indiscriminately, which is always both annoying and unethical. The amount of spam emails keeps increasing constantly, which leads to an imperative problem to the email communication today. The statistics [1] show that more than 50% of emails were spam emails in 2004, up from 8% in 2001. In 2010, the number rose to 89% with 262 billion spam messages daily [2]. Spam emails also take up a huge amount of storage space in the users’ inbox and bandwidth used to deliver them. Some spam emails exploited by malicious users include malicious content which can harm users’ hosts. Even worse, spammers are developing more efficient new ways to send spam emails. In response to the threat by spam emails, a variety of spam filters have been proposed aiming to filter out as many spam emails as possible with a false positive rate as low as possible. According to filtering technology, the existing spam filter methods can be mainly classified into source-based method and content-based method. The source-based method identifies spams according to the information from the header of the email. A typical application of this method is black/white list [3,4]. The black/white list contains a number of addresses which can be email addresses, IP addresses or domain names considered as spam sources or legitimate sources, respectively. Once the email sender’s address matches with the corresponding type of address in the black/white list, this email will be blocked or receive a green light as a warning. This method is easy-to-use and of quick response, whereas the number of the items is limited so that the accuracy rate is not high and it may also blocked the legitimate mails leading to a high false positive rate. The content-based method identifies spams based on the analysis of the mail content. The most basic content-based filter is the static keyword-based filter [5] that maintains a blacklist of spam keywords. The filter parses the incoming emails into a series of keywords used to be compared with the spam keywords in the blacklist to determine whether they are spam emails or not. However, spammers can continuously change the characteristics of spam emails to degrade its performance. The heuristic filter [6] is similar to the static keyword-based filter. It parses the content of the incoming emails to identify spams by setting a series of rules. For instance, we can set a rule that if the content of an incoming email has the word “drug”, it will be labeled as spam. It suffers from the same problem with the static keyword-based filter and brings difficulty to average user to set complicated rules. The machine learning filter especially the Bayesian filter is widely used and has an outstanding performance, but it requires too much user’s involvement during the training period and is usually deployed on the mail server so that it is not personalized. All the traditional spam filters have two common problems. One is users in the network can not share their knowledge about spams with each other. The other is those methods rarely take user’s personalization(user’s individual interests) into account. Since the online social network becomes more and more popular, it brings a new direction to solve the spam problem. By October 2012, the number of registered users in Facebook has exceeded 1 billion. Users in the social network can share information with their 978-1-4799-0959-9/14/$31.00 ©2014 IEEE 887