The Changing Nature of Spam 2.0 Vidyasagar Potdar Anti Spam Research Lab School of Information Systems Curtin University Perth, Australia v.potdar@curtin.edu.au Yan Like BTOL Technologies Inc Chengdu, China ylkhao_hao@163.com Nazanin Firoozeh Erasmus Mundus Data Mining and Knowledge Management, University of Pierre and Marie Curie, Paris, France nazanin.firoozeh@etu.upmc.fr Debajyoti Mukhopadhyay Department of Information Technology, Maharashtra Institute of Technology Pune 411038, India debajyoti.mukhopadhyay @gmail.com Farida Ridzuan Anti Spam Research Lab School of Information Systems Curtin University Perth, Australia farida.mohdridzuan@postgrad.c urtin.edu.au Dhiren Tejani Anti Spam Research Lab Curtin University Perth, Australia dhiren_tejani@hotmail.com ABSTRACT Spam 2.0 (or Web 2.0 Spam) is referred to as spam content that is hosted on Web 2.0 applications (blogs, forums, social networks etc.). Such spam differs from traditional spam as this is targeted at Web 2.0 applications and spreads through legitimate websites. The main problems with Spam 2.0 is spam websites get undeserved high ranking in search engines, damage the reputation of legitimate websites, wastes' valuable computing resources and deceives users resulting in proliferation of scam, fraud and other security attacks. Protecting the Internet against Spam 2.0 attacks is increasingly becoming important due to the potential threats it poses to the innocent web users. The paper contributes in this direction by attempting to understand the root cause of the problem, by investigating the changing nature of Spam 2.0. To understand this we setup an online discussion forum as a Honeypot to capture spam content. The collected data is analysed to identify trends within the spam corpus, which includes repetitiveness in the use of email addresses, patterns within email addresses, repetitiveness of forum posts, domains used for spamming, keywords and categories, origin of spam traffic. In the future we aim to use these trends in developing a preventive or early detection system that could predict future spam activities and would allow us to take pre-emptive actions to address them. Categories and Subject Descriptors K.4.4 [Electronic Commerce]: Cybercash, digital cash, distributed commercial transactions, electronic data interchange, intellectual property, payment schemes, security. General Terms Measurement, Documentation, Performance, Design, Economics. Keywords Spam Dataset, Spam tactics, Spammer Profiling, Spam Content Analysis. 1. INTRODUCTION Web 2.0 is commonly associated with web applications that facilitate interactive information sharing, interoperability, user- centered design and collaboration on the World Wide Web. The read/write Web 2.0 concepts are the backbone of the vast majority of web services that we use today. Web 2.0 promotes an increasing emphasis on human collaboration that encourages users to add value to web applications as they use them. Today, these Web 2.0 functions are commonly found in web-based communities, applications, social-networking sites, media-sharing sites, wikis, blogs, mashups, and folksonomies. They are widely provided by government, public, private and personal entities. Spam abuses such electronic messaging systems by sending unsolicited messages in bulk indiscriminately. The intentions of such systems are to misinform users (scams), generate traffic, generate sales (marketing/advertising), and occasionally compromise parties, people or systems by spreading spyware or malware. While the most widely recognized form of spam is e- mail spam, the term is applied to similar abuses in other media. Spam 2.0 (Web 2.0 Spam) is defined as “sets of actions that are performed by automated malicious spambot agents and/or human spammers that result in bulk unsolicited and indiscriminately hosted information on web 2.0 applications” [1- 3]. Web Spambots (simply spambots) can be used to automatically and repetitively crawl the web, find email addresses and/or web 2.0 applications and send bulk messages indiscriminately. Spambots often share a common code base and malicious intentions with other malicious malwares. There are clear links between spam, scams and computer malware [4, 5, 6]. Typically, spammers, scammers and hackers collaborate to attack networks, destroy cyber infrastructure, hijack computers, spy on private/confidential data, obtain privileged information (for example, weaponry, industrial secrets, identity theft, other classified information) and spread spam. Throughout the current decade, the Internet has also accumulated a significant amount of Spam 2.0 that is continually Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CUBE 2012, September 3–5, 2012, Pune, Maharashtra, India. Copyright 2012 ACM 978-1-4503-1185-4/12/09…$10.00. 826