A Decentralized and Personalized Spam
Filter Based on Social Computing
Xin Liu, Zhaojun Xin, Leyi Shi, Yao Wang
College of Computer & Communication Engineering
China University of Petroleum(East China)
Qingdao, China
lx@upc.edu.cn, shileyi@upc.edu.cn, xinzhaojun567@sina.com, wangyao@126.com
Abstract—Spam is an imperative problem to the email
communication today. Different users may have different views
on judging spam which makes it difficult to filter spam from
normal emails for email server. We found users with similar
interest may have similar opinions. So in this paper, we
proposed a spam filtering approach in which a collaborative
and personalized spam filter based on social network is
developed. The key idea is to enable users to push spam
reports to their social network friends with similar interest,
which reflects collaboration and personalization. Our proposal
takes advantage of push technology to share user’s individual
spam knowledge with others via social network, which utilizes
wisdom of crowds to resist spam. According to interest
similarity among users, a user can determine whether to push
spam reports to his friends with the purpose of taking user’s
individual interest into consideration. We integrate an interest-
based spam filter with a basic Bayesian filter to discriminate
spam from legitimate emails. The evaluation of our proposal
shows that it significantly improves the performance compared
with Bayesian filter according to the accuracy rate.
Keywords- social network; spam filtering; user interest;
interest similarity; push technology
I. INTRODUCTION
Email is so convenient and easy-to-use that it is one of
the most important and widespread communication
applications. However, spam is the illicit use of the email to
send unsolicited bulk messages indiscriminately, which is
always both annoying and unethical. The amount of spam
emails keeps increasing constantly, which leads to an
imperative problem to the email communication today. The
statistics [1] show that more than 50% of emails were spam
emails in 2004, up from 8% in 2001. In 2010, the number
rose to 89% with 262 billion spam messages daily [2]. Spam
emails also take up a huge amount of storage space in the
users’ inbox and bandwidth used to deliver them. Some
spam emails exploited by malicious users include malicious
content which can harm users’ hosts. Even worse, spammers
are developing more efficient new ways to send spam
emails. In response to the threat by spam emails, a variety of
spam filters have been proposed aiming to filter out as many
spam emails as possible with a false positive rate as low as
possible.
According to filtering technology, the existing spam
filter methods can be mainly classified into source-based
method and content-based method. The source-based
method identifies spams according to the information from
the header of the email. A typical application of this method
is black/white list [3,4]. The black/white list contains a
number of addresses which can be email addresses, IP
addresses or domain names considered as spam sources or
legitimate sources, respectively. Once the email sender’s
address matches with the corresponding type of address in
the black/white list, this email will be blocked or receive a
green light as a warning. This method is easy-to-use and of
quick response, whereas the number of the items is limited
so that the accuracy rate is not high and it may also blocked
the legitimate mails leading to a high false positive rate.
The content-based method identifies spams based on the
analysis of the mail content. The most basic content-based
filter is the static keyword-based filter [5] that maintains a
blacklist of spam keywords. The filter parses the incoming
emails into a series of keywords used to be compared with
the spam keywords in the blacklist to determine whether
they are spam emails or not. However, spammers can
continuously change the characteristics of spam emails to
degrade its performance. The heuristic filter [6] is similar to
the static keyword-based filter. It parses the content of the
incoming emails to identify spams by setting a series of
rules. For instance, we can set a rule that if the content of an
incoming email has the word “drug”, it will be labeled as
spam. It suffers from the same problem with the static
keyword-based filter and brings difficulty to average user to
set complicated rules. The machine learning filter especially
the Bayesian filter is widely used and has an outstanding
performance, but it requires too much user’s involvement
during the training period and is usually deployed on the
mail server so that it is not personalized.
All the traditional spam filters have two common
problems. One is users in the network can not share their
knowledge about spams with each other. The other is those
methods rarely take user’s personalization(user’s individual
interests) into account. Since the online social network
becomes more and more popular, it brings a new direction
to solve the spam problem. By October 2012, the number of
registered users in Facebook has exceeded 1 billion. Users
in the social network can share information with their
978-1-4799-0959-9/14/$31.00 ©2014 IEEE 887