International Journal of Engineering Trends and Technology (IJETT) – Volume 30 Number 9 - December 2015 ISSN: 2231-5381 http://www.ijettjournal.org Page 444 Survey on Spam Filtering Techniques and Mapreduce Prajakta S. Patil #1 , Prof. Rashmi A. Rane #2 , Prof. Madhuri A. Bhalekar #3 Department of Computer Engineering Maharashtra Institute of Technology Pune, India Abstract—Spam Email, also known as junk email , is a subset of electronic spam involving nearly identical messages sent to numerous recipients by email. The messages may contain disguised links that appear to be for familiar websites but in fact lead to phishing web sites or sites that are hosting malware. Spam email may also include malware as scripts or other executable file attachments. Spam is any unwanted and harmful mail. Separation of spam from normal mails is essential. This paper surveys different spam email filtering techniques. The different techniques are Machine learning based, list based, content based and hybrid or other. Machine learning based, is mostly used because of high accuracy and mathematical support. Keywords—Spam filtering techniques, Machine learning based ,content based, word based. I. INTRODUCTION The email system is one of the most used, modern day communication tools. Wide availability of an email system is working as a boon for business. Email is a quick as no need to wait for the response and it is straight forward way to stay in touch with the all. One threat to an email system is spam mail. The spam is nothing but the unwanted mail. The definition of spam is mail which is sent in bulk. Spam email, also known as junk email which has the abundant recipients. Normally, spam’s contain links to phishing web sites or malware hosting web sites. Spam email may also include malware as scripts or other executable file attachments. Beside these, for checking legitimacy of mail consumes valuable time. According to the SMX email security provider, the live spam percentage is about 79.5% [1]. The average size of spam is 16 KB. For the separation of such spams from important mails, spam filtering is important. Amongst these, Naïve Bayesian classification, Support Vector Machine, K-Nearest Neighbor are most used and appreciated by researchers. Also, number of freeware and paid tools are available for spam filtering, which makes use of these techniques. Machine learning technique like Support Vector Machines (SVM) can be applied efficiently in spam filtering. The training process of SVM is the compute intensive process, so there is a lot of scope to introduce the Map Reduce platform for spam filter training. Map Reduce paradigm works with Map and Reduce tasks and these tasks are independent. During the SVM training, for each data point in hyperspace, maximize the margin in hyper plane. Data mining is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems. This problem has been researched by many scholars in all kinds of application area for many years and many data mining methods have been developed and applied to practice. However, most classical data mining methods out of reach in practice in face of big data. Computation and data intensive scientific data analyses are increasingly prevalent in recent years. Support Vector Machines (SVMs) are powerful classification and regression tools, but their compute and storage requirements increase rapidly with the number of training vectors, putting many problems of practical interest out of their reach. Efficient parallel algorithms and implementation techniques are the key to meeting the scalability and performance requirements entailed in such large scale data mining analyses. II. OVERVIEW OF EMAIL SYSTEM In this section, a brief explanation of email protocol and the process of filtering will be elaborated. Simple Mail Transfer Protocol (SMTP) is the first protocol which transfers the emails by some commands. Figure illustrates SMTP commands. First, TCP/IP (Transmission Control protocol and Internet Protocol) connection starts between sender and the associated mail server. Following that, the SMTP commands begin with a Hello message and announcing the acceptance of the session between the client and the server. This process ends when the message is accepted by the mail server. TCP connection disconnects if there is no more message from the client to the mail server. When the email is delivered by the server, the filtering phase is started. Based on the server filtering policy, Blacklist and White list filtering is stared to examine if the email is a spam or a valid one. If the email is recognized as a valid one, it is sent to receiver’s inbox otherwise the email is blocked or transferred to the spam folder. When a Grey list filtering is used in relevant mail server, the email is rejected for the first time. Afterward the body of the email is tested with content-based and rule- based filters according to the standards of the administrator