Research Article
Fingerprint-Based Near-Duplicate Document Detection with
Applications to SNS Spam Detection
Phuc-Tran Ho
1
and Sung-Ryul Kim
2
1
Department of Advanced Technology Fusion, Konkuk University, Seoul 143-701, Republic of Korea
2
Department of Internet & Multimedia Engineering, Konkuk University, Seoul 143-701, Republic of Korea
Correspondence should be addressed to Sung-Ryul Kim; kimsr@konkuk.ac.kr
Received 4 December 2013; Revised 8 April 2014; Accepted 8 April 2014; Published 6 May 2014
Academic Editor: Hoon Ko
Copyright © 2014 P.-T. Ho and S.-R. Kim. his is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
Social networking has been used widely by millions of people over the world. It has become the most popular way for people who
want to connect and interact online with their friends. Currently, there are many social networking sites, for instance, Facebook,
My Space, and Twitter, with a huge number of active users. herefore, they are also good places for spammers or cheaters who
want to steal the personal information of users or advertise their products. Recently, many proposed methods are applied to detect
spam comments on social networks with diferent techniques. In this paper, we propose a similarity-based method that combines
ingerprinting technique with trie-tree data structure and meet-in-the-middle approach in order to achieve a higher accuracy in
spam comments detection. Using our proposed approach, we are able to detect around 98% spam comments in our dataset.
1. Introduction
In the last few years, social networking has been known as
an Internet phenomenon. It has become the main way for
people to connect and keep track with their friends online.
he most popular social networking sites such as Facebook,
Twitter, and My Space are consistently among the top 20
most viewed websites on the Internet. Many people spend
more and more time for enjoying the virtual lives on the
social networks rather than their real lives. Moreover, their
personal information that is stored and shared on such sites
is usually under loose security. Hence, social networking is
also a potential target for spammers and cheaters who want
to advertise their products or more dangerously steal the
users’ information. here are many simple tricks, for example,
posting fake updates that contain malicious links, abusing
the comment function to post unsolicited messages to users,
images trick, and social engineering, with which spammers
can achieve their purposes easily.
Spam comments usually have duplicate or near-duplicate
contents. herefore, they can be detected by several common
methods that are used to detect duplicate and near-duplicate
documents in the web mining ield.
Duplicate and mirror web pages are seen in plenty
in the World Wide Web [1]. Besides that, near-duplicate
documents are mostly identical to the original ones but difer
in several small portions of document such as advertisements,
timestamps, or counters.
Recently, duplicate and near-duplicate documents detec-
tion is important in various computer science ields, specii-
cally data mining, information retrieval, and web mining. Its
advantage is saving storage for necessary data instead of those
duplicated. A sizeable percentage of web pages are found
to be near duplicate by several studies [2–4]. hese studies
suggested that approximately 1.7% to 7% of the web pages
visited by crawlers are near-duplicate pages. Although the
problem due to mirroring and plagiarism is detected simply
by applying several techniques such as machine learning and
document clustering, near-duplicate documents are more
diicult to identify.
In this paper, we propose a method using trie-tree data
structure to store a set of 64-bit strings, each of which is
Hindawi Publishing Corporation
International Journal of Distributed Sensor Networks
Volume 2014, Article ID 612970, 8 pages
http://dx.doi.org/10.1155/2014/612970