International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868 Foundation of Computer Science FCS, New York, USA Volume 12 – No. 7, October 2017 – www.ijais.org 21 Phishing Detection in E-mails using Machine Learning Srishti Rawal VIT University Chennai, India Bhuvan Rawal BITS Pilani, Goa, India Aakhila Shaheen BITS Pilani Hyderabad, India Shubham Malik VIT University Chennai, India ABSTRACT Emails are widely used as a means of communication for personal and professional use. The information exchanged over mails is often sensitive and confidential such as banking information, credit reports, login details etc. This makes them valuable to cyber criminals who can use the information for malicious purposes. Phishing is a strategy used by fraudsters to obtain sensitive information from people by pretending to be from recognized sources. In a phished email, the sender can convince you to provide personal information under false pretenses. This experimentation considers the detection of a phished email as a classification problem and this paper describes the use of machine learning algorithms to classify emails as phished or ham. Maximum accuracy of 99. 87% is achieved in classification of emails using SVM and Random Forest classifier. General Terms Phishing, security, classification Keywords Phishing detection, SVM, ham, naive bayes, machine learning, email fraud, artificial intelligence 1. INTRODUCTION Phishing is a lucrative type of fraud in which the criminal deceives receivers and obtains confidential information from them under false pretenses. Phished emails may direct the users to click on a link of a website or attachment where they are required to provide confidential information like passwords, credit card information etc. The phisher sends out the messages to thousands of users and usually only a small percentage of recipients may fall into the trap but this can result in high profits for the sender. In 2006, hackers in America used emails as a mode of setting “baits” for users to steal usernames and passwords of American Online accounts. Ever since then the techniques of phishing have evolved making it harder to identify fraudulent emails. As per the 2016 data breach report by Verizon, roughly 636,000 phishing emails were sent out of which only 3% of the targeted individuals alerted the management of a possible phished emails. A massive phishing attack targeting millions of Gmail users hit google in May 2017, in which the hacker gained access email histories of users. Through this information, the hackers were able to pose emails as belonging to a known source and asked them to check the attached file. On clicking the link to attacked file, the users were asked to give permission for a fake app to manage users email account. With the ever increasing use of emails and growth of technologies, risk of losing valuable information to fraudsters has also been increasing. This paper focuses on identifying a phished email with the help of machine learning algorithms. In the proposed system, detecting phished email can be described as a classification problem with two categories i.e. ham and phished. Machine Learning is a field of artificial intelligence in which the system is given the ability to learn without being explicitly programmed. In our model, supervised machine learning algorithms are used for classification. Supervised learning algorithms predict the nature of unknown data based on the known examples. These algorithms are a subset of machine learning algorithms which iteratively learn from data. The remainder of the paper is organized as follows. Section 2 discusses the existing systems used for detection of phishing in emails. The third section describes proposed system, the algorithms used and provides a brief description of the features used. Further, in section 4, the results obtained are explained. In the fifth section, a conclusion is drawn and followed by this is the reference section. 2. RELATED WORK Andronicus et al. used random forest machine learning classifier is used for classification of phished emails. They have aimed to maximize the accuracy and minimize the number of features required for classification. A content-based phishing detection approach which has high accuracy is presented. In [2], authors proposed a model based on extracted features which appear in the header and HTML body of email which are classified using feed forward neural network. The results indicate 98.72% accuracy of classification. In [3], over 7000 emails are used in dataset and a number of different features used. Overall accuracy of 99.5% is achieved. Gilchan Park et. al. aimed to extract robust features in order to discriminate legitimate and phished emails. A comparison of sentence syntactic similarity and the difference in subjects and objects of target verbs between phishing emails and legitimate emails is done. In “Email Phishing : An open threat to everyone”, the different techniques of phishing are analyzed and suggestions for users to avoid falling into the trap of fraudsters are provided. C. Emilin Shyni et al. proposes a methodology incorporating natural language processing, machine learning and image processing is described. They use a total of 61 features are used. They achieved an classification accuracy of above 96% using a multi-classifier. In “Detection Phishing Emails Using Features Decisive values”, 18 features are extracted and the proposed algorithm classifies each email depending upon existence of flags and weightage of features. Their results show that out of the 18 features extracted, high accuracy of can be obtained if most effective features are used for classification In “Phish-IDetectore” authors focus on the properties of Message-IDs and apply n-gram analysis to the Message-IDs.