Spam Campaign Cluster Detection Using Redirected URLs and Randomized Sub-Domains Abu Awal Md Shoeb, Dibya Mukhopadhyay, Shahid Al Noor, Alan Sprague, Gary Warner Department of Computer and Information Sciences University of Alabama at Birmingham, AL, USA {shoeb, dibya, shaahid, sprague, gar}@uab.edu Abstract A substantial majority of the email sent everyday is spam. Spam emails cause many problems if someone acts or clicks on the link provided in the email body. The problems may include infecting users personal machine with malware, stealing personal information, capturing credit card infor- mation, etc. Since spam emails are generated as a part of a very limited numbers of spam campaigns, it is useful to cluster spam messages into campaigns, so as to identify which campaigns are the largest. This enables investigation to focus this attention on the largest as the most signiﬁ- cant clusters. In this paper, we present a method to cluster spam emails into spam campaigns. In our approach, the redirected URL has been chosen as the primary ﬁeld for cluster formation. Our study shows that, a huge number of URLs arriving in spam email eventually points to a much smaller set of redirected URLs. Our multilevel clustering method grouped 90% of our half million spam emails into 4 spam campaigns. In addition to redirected URLs, we also use randomized sub domains, which come as a given URL in email body, for campaign identiﬁcation. We believe that our model can be applied in real time to quickly detect ma- jor campaign. 1 Introduction Spam email identiﬁcation, also called ﬁltering is an essential concern of numerous internet security companies. Accord- ing to the Kaspersky Security Bulletin in 2013, about 70% of all emails sent today is spam [16]. A spam message is orig- inated from a spam campaign. A campaign is a collection of messages that are generated from a single message tem- plate. The two primary method of ﬁltering spam emails are content-based and blacklist-based [2]. Content based ap- proach considers several factors such as, number of words in page tittle and body along with their average length, fraction of visible content and globally popular words, com- pressibility, n-gram likelihoods etc. during spam detection [3]. On the other hand, in blacklist-based approach, the well-known spamming hosts are detected, blacklisted, and blocked [5]. Several researchers considered IP address of the botnets to detect spam emails [4, 6, 7]. Even though detecting blacklisted IP is a naive approach, which can be done using limited computing resources, compiling and maintaining in such approach is challenging. Often the attackers change the host IP address or patch an already compromised host [8]. In contrary, some researchers proposed white list IP ap- proach which maintains a list of trusted IP while anything other than those IP will be considered spam [9]. However, detecting white listed IP is not easy and often a legitimate email is categorized into a blacklisted email list due to the presence of a large numbers of emails [10]. Challenge re- sponse technique is another popular method for detecting spam where sender has to prove its authenticity by replying on the challenge sent by the recipient [11]. However, when both party implement this approach, the system can be in deadlock [18]. For many reasons, spam campaign identiﬁcation is a chal- lenging problem. First of all, spam has to be identiﬁed on the ﬂy as the attributes of a spam campaign namely, email subject, URL, IP addresses, message body, and etc. change very frequently. Secondly, it is hard to catch the pattern or template of the spam without having a large collection of spam data set. Finally, analyzing the diﬀerent parameters of a spam campaign requires lots of analysis that is time consuming and requires high computing resources. The University of Alabama at Birmingham maintains the UAB Spam Data Mine. This data mine receives one million spam emails daily, and contains nearly one billion spam emails in total[20]. The process of mining spam data is time consuming. It requires going through every email to mine and cluster them based on their multiple attributes. A spam email has many attributes in common as a regular email. However, it con- tains some additional properties that help to identify them properly. In order to detect spam mail, we consider the given URLs, URLs embedded within the spam emails, along with its subject. Proper selection of an attribute is a key part of real time campaign identiﬁcation. We process more than a half million of spam emails based on their URLs and subject. We ﬁnd all redirected URLs by processing their given URLs. We also utilize the sub-domain of the given URLs by splitting them into several parts. The subject of the email has been used as a secondary attribute. As a result, our approach has come to minimize the clustering time. Since we have chosen redirected URL as our primary attribute of clustering, a major campaign containing URL can be identiﬁed within a very short period of time. Since a large campaign can aﬀect a larger mass of users, a large campaign can be way more harmful compared to