International Journal of Computer Applications (0975 – 8887) Volume 78 – No.10, September 2013 21 Spammer Detection by Extracting Message Parameters from Spam Emails Acquin Dmello Student, St. Francis Institute of Technology, Mumbai, India Gaurang Mhatre Student, Sardar Patel Institute of Technology, Mumbai, India Rohan Lopes Student, St. Francis Institute of Technology, Mumbai, India Haince Pen Alumni, St. Francis Institute of Technology, Mumbai, India ABSTRACT Traditional and present methods to detect spam emails have been working quite well but they take no measures to detect and occlude the malicious actions of the spammers. In this paper a combination of certain parameters of an email is considered to cluster legit emails and spam emails. Initially, this approach tries to cluster spam emails. Based on their sources, the spam emails are clustered using their Message subjects, Attachments, Number of Hyperlinks, Message length, Stylistic and Semantic parameters. Since emails from same source have certain similarities, they are clustered together. These clusters are then mapped to their respective domains and their IP address is retrieved which is then reported to Anti-Spam Agencies. General Terms Algorithm, Clustering, Feature Extraction, Spammer Detection, Security, Algorithm, WEKA. Keywords Detection, Email Parameters, Information Extraction, Spam 1. INTRODUCTION Spam related cyber-crimes are one of the most ever growing threats to the society. Spamming contribute to illegal earnings by selling various products and also spread malwares which serves as a medium to steal confidential data from a user’s computer or makes their functioning ambiguous. Present methods to combat spamming have been quite lethargic since they serve as a temporary means to occlude the spamming effect. The best system to completely prevent spam emails is to stop the source of a spam, that is, to detect the spammer. At present there are three important techniques which are used by spammers to send spam emails [1]: Open relays and Proxies Botnets Short-lived BGP announced routers Open relays are a type of mail servers which allow any internet hosts to connect and send emails through them. Botnets are collections of machines that communicate with other machines to perform a similar task under one centralized controller. But these techniques are not so powerful, as current Anti-spam technologies are able to mitigate their effect. The most complex technique is the Short-lived BGP announced routers technique, in which a spammer announces an IP space, sends spam emails and then the IP space vanishes after some time [1], [2]. In this way, spammers manage to remain in dark. This research proposes a combined approach towards detecting the source of spam and reports it to Anti-spam agencies. To block spammers’ from sending spam emails, their supporting architecture should be eradicated. Hindering the functionality of spam hosts will highly abate spammers’ revenue from illegal email campaigns and obstruct their ability to do spam email cyber-crimes. This research promotes a combination of algorithms for clustering spam email domains based on the hosting IP addresses and other emails parameters. This combination of algorithms detects potential spammer source over certain period of time. Evaluated experimental results show that when domain names are examined, it is found that many unrelated spam emails are actually related. By using wildcard DNS records and constantly replacing old IP domains with new IP domains, spammers can efficiently spoof URL or domain based blacklisting. Spammers also change their IP addresses occasionally, but not as frequently as domains. The domains and IP addresses that are identified using the method proposed in this paper, cyber crime investigators can be forwarded to trace the identities of spammers and the investigators can shut down the related spamming architecture. This paper illustrates how data mining and clustering techniques can help to detect spam domains and their hosts for anti-spam forensic purposes. 2. RELATED WORK Today, researchers on spam are interested in identifying and obstructing the source of spam emails and not just identify the spam emails. Spam can be more effectively stopped by disrupting its source, such as the C&C and hosting servers by taking legal actions. This paper takes the same concept into consideration and proceeds with the research. The goal of this spam research is to cluster spam emails and identify spamming infrastructure that belongs to the same spamming group. In this paper, the related research is reviewed, including anti-spam and clustering algorithms on data streams. Spammer detection ability can be increased by considering more parameters for clustering. According to Halder et al [3], spam emails have some identical styles and semantics within them. He proposed that spam emails can be identified using stylistic and semantic approach and hence identifies the spammer with the help of feature extraction using Data mining. Li et al [4] claimed that spam emails are generally sent in groups having certain similarities in between them with respect to their domain, URLs present within them or prototype. Hence their research specified that different spam campaigns around the world can be grouped under a small group of spammers. Also Chun et al [5] deliberated on similar tendencies of spammer behavior and inferred that clustering emails together based on their subjects and IP addresses can prove to be an effective strategy in determining spammer source. They did