(IJCSIS) International Journal of Computer Science and Information Security, Vol. 14, No. 4, April 2016 Phishing Identiﬁcation Using a Novel Non-Rule Neuro-Fuzzy Model Luong Anh Tuan Nguyen Faculty of Information Technology Ho Chi Minh City University of Transport Ho Chi Minh City, Vietnam Email: nlatuan@hcmutrans.edu.vn Huu Khuong Nguyen Faculty of Information Technology Ho Chi Minh City University of Transport Ho Chi Minh City, Vietnam Email: nhkhuong@hcmutrans.edu.vn Abstract—This paper presents a novel approach to overcome the difﬁculty and complexity in identifying phishing sites. Neural networks and fuzzy systems can be combined to join its ad- vantages and to cure its individual illness. This paper proposed a new neuro-fuzzy model without using rule sets for phishing identiﬁcation. Speciﬁcally, the proposed technique calculates the value of heuristics from membership functions. Then, the weights are trained by neural network. The proposed technique is evaluated with the datasets of 22,000 phishing sites and 10,000 legitimate sites. The results show that the proposed technique can identify with an accuracy identiﬁcation rate of above 99%. Keywords—Phishing; Fuzzy; Neural Network; Neuro-Fuzzy I. I NTRODUCTION According to a study by Gartner [1], 57 million US Internet users have identiﬁed the receipt of email linked to phishing scams and about 2 million of them are estimated to have been tricked into giving away sensitive information. According to the reports of the Anti-Phishing Working Group [2], the number of phishing attacks is increasing by 5% monthly. Figure 1 shows the phishing website report received in the ﬁrst quarter of 2014, showing that the risk of phishing is extremely high. For these reasons, identifying phishing attacks is very urgent and important in modern society. Recently, there have been many studies that against phish- ing based on the characteristics of site, such as URL of website, content of website, combining both the website URL and content, source code of website or interface of website, etc. However, each of studies has its own strengths and weaknesses. There is still not a sufﬁcient method. In this paper, a new approach is proposed to identify the phishing sites that focuses on the features of URL (PrimaryDomain, SubDomain, PathDomain) and the web trafﬁc (PageRank, AlexaRank, AlexaReputation, GoogleIndex, BackLink). Then, a proposed neuro-fuzzy network is a system which reduces the error and increases the performance. The proposed neuro-fuzzy model uses computational models to perform without rule sets. The proposed solution achieved identiﬁcation accuracy above 99% with low false signals. The rest of this paper is organized as follows: Section II presents the related works. System design is shown in section III. Section IV evaluates the accuracy of the method. Finally, Section V concludes the paper and ﬁgures out the future works. Fig. 1: Phishing reports received in the period of January- March 2014 II. RELATED WORK Up to now, methods for identifying phishing can be divided into three groups: blacklist, heuristic and machine learning. In the ﬁrst approach, the phishing identiﬁcation technique [3][4][5][6] maintains a list of phishing websites called black- list. The blacklist technique is inefﬁcient due to the rapid growth in the number of phishing sites. Therefore, the heuristic and machine learning approaches have received more attraction of researchers. Cantina [7] presented the algorithm TF-IDF based on 27 features of webpage. This technique can identify 97% phishing sites with 6% false positives. Although this technique is efﬁcient, the time extracting 27 features of webpage is too long to meet real time demand and some features are not necessary for improving the phishing identiﬁcation accuracy. Similarly, Cantina+ [8] used machine learning techniques based on 15 features of webpage and only six of 15 features are efﬁcient for phishing identiﬁcation such as bad form, Bad action ﬁelds, Non-matching URLs, Page in top search results, Search copyright brand plus domain and Search copyright brand plus hostname. In [9], the author used the URL to identify phishing sites automatically by extracting and verifying different terms of a URL through search engine. Even though this paper proposed a new interesting technique, the identiﬁcation rate is quite low (54.3%). The technique [10] developed a content- based approach to identify phishing called CANTINA, which considers the Google PageRank value of a page, the evaluation dataset is quite small. The characteristic of the source code is used to identify phishing sites in [11]. 8 https://sites.google.com/site/ijcsis/ ISSN 1947-5500