Research Article Fingerprint-Based Near-Duplicate Document Detection with Applications to SNS Spam Detection Phuc-Tran Ho 1 and Sung-Ryul Kim 2 1 Department of Advanced Technology Fusion, Konkuk University, Seoul 143-701, Republic of Korea 2 Department of Internet & Multimedia Engineering, Konkuk University, Seoul 143-701, Republic of Korea Correspondence should be addressed to Sung-Ryul Kim; kimsr@konkuk.ac.kr Received 4 December 2013; Revised 8 April 2014; Accepted 8 April 2014; Published 6 May 2014 Academic Editor: Hoon Ko Copyright © 2014 P.-T. Ho and S.-R. Kim. his is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Social networking has been used widely by millions of people over the world. It has become the most popular way for people who want to connect and interact online with their friends. Currently, there are many social networking sites, for instance, Facebook, My Space, and Twitter, with a huge number of active users. herefore, they are also good places for spammers or cheaters who want to steal the personal information of users or advertise their products. Recently, many proposed methods are applied to detect spam comments on social networks with diferent techniques. In this paper, we propose a similarity-based method that combines ingerprinting technique with trie-tree data structure and meet-in-the-middle approach in order to achieve a higher accuracy in spam comments detection. Using our proposed approach, we are able to detect around 98% spam comments in our dataset. 1. Introduction In the last few years, social networking has been known as an Internet phenomenon. It has become the main way for people to connect and keep track with their friends online. he most popular social networking sites such as Facebook, Twitter, and My Space are consistently among the top 20 most viewed websites on the Internet. Many people spend more and more time for enjoying the virtual lives on the social networks rather than their real lives. Moreover, their personal information that is stored and shared on such sites is usually under loose security. Hence, social networking is also a potential target for spammers and cheaters who want to advertise their products or more dangerously steal the users’ information. here are many simple tricks, for example, posting fake updates that contain malicious links, abusing the comment function to post unsolicited messages to users, images trick, and social engineering, with which spammers can achieve their purposes easily. Spam comments usually have duplicate or near-duplicate contents. herefore, they can be detected by several common methods that are used to detect duplicate and near-duplicate documents in the web mining ield. Duplicate and mirror web pages are seen in plenty in the World Wide Web [1]. Besides that, near-duplicate documents are mostly identical to the original ones but difer in several small portions of document such as advertisements, timestamps, or counters. Recently, duplicate and near-duplicate documents detec- tion is important in various computer science ields, specii- cally data mining, information retrieval, and web mining. Its advantage is saving storage for necessary data instead of those duplicated. A sizeable percentage of web pages are found to be near duplicate by several studies [2–4]. hese studies suggested that approximately 1.7% to 7% of the web pages visited by crawlers are near-duplicate pages. Although the problem due to mirroring and plagiarism is detected simply by applying several techniques such as machine learning and document clustering, near-duplicate documents are more diicult to identify. In this paper, we propose a method using trie-tree data structure to store a set of 64-bit strings, each of which is Hindawi Publishing Corporation International Journal of Distributed Sensor Networks Volume 2014, Article ID 612970, 8 pages http://dx.doi.org/10.1155/2014/612970