Lecture Notes in Management Science (2015) Vol. 7, 5055 ISSN 2008-0050 (Print), ISSN 1927-0097 (Online) Copyright © ORLab Analytics Inc. All rights reserved. www.orlabanalytics.ca Fraudulent URL Classification with an RBFNN Dirk Snyman 1 , Tiny du Toit 1 and Hennie Kruger 1 1 North-West University, Potchefstroom Campus, Potchefstroom, South Africa {dirk.snyman, tiny.dutoit, hennie.kruger}@nwu.ac.za Proc. ICAOR 2015 Vienna, Austria Abstract Keywords: Neural networks Artificial intelligence Machine learning Phishing attacks are a social engineering scam which aim to steal sensitive information like personal, financial and social details, from unsuspecting consumers. It is based on the premise that a significant number of consumers are ignorant of the technical security practices in a digital environment. These attacks establish trust from the user in order to lull them into a false sense of familiarity in which they are more likely to freely supply their identifying information. In this study an intelligent approach to phishing detection is presented. The method is based on the information contained in the Uniform Resource Locator (URL) that points to a phishing website. These URLs are analysed and classified as malicious or safe, by a new automated Radial Basis Function Neural Network (RBFNN) construction algorithm. This technique determines the best neural network architecture for a specific classification task and uses an in-sample model selection criterion. Example URLs which were collected from real world repositories, Open Directory and Phishtank, are used to train and evaluate the neural network. The results of a cross-validation experiment setup are presented. It was found that the RBFNN algorithm outperforms a naïve Bayes classifier baseline which is considered to be a standard text classification baseline in literature due to its simple structure and linear execution. Introduction In a range of publications of information security analysis over an extended period, Richardson (2010; 2008) and Berger (2012) state that viruses and malware account for the most commonly seen type of malicious intrusions noted in the corporate environment. Of the surveyed users, 49% reported incidents leading to financial losses and 67.1% reported information losses. These reported incidents extend beyond the corporate environment and also include personal losses. This poses a great risk to companies and individuals in the digital domain. Human malicious actions (in terms of wilful information security breaches or exploits from within the company) accounts for a much smaller fraction of the reported incidents of security violations. Richardson (2010; 2008) and Berger (2012) report that only 3 percent of the losses are attributed to malicious intent by insiders. The losses due to human negligence or error, without the intent being to breach information security, is reported to be at 14.5%; a larger number than malicious intent and arguably one of the most preventable issues. One type of malicious endeavour that relies on and specifically exploits human error and ignorance is phishing (APWG, 2014; Jacobson, 2007; Dhamija et al., 2006). The Anti-Phishing Working Group (2014) describes phishing as a malicious endeavor that aims to steal sensitive information like personal, financial and social details, from unsuspecting consumers by employing social engineering approaches in a technical environment. These approaches are usually based on fraudulent electronic messages that are sent to consumers which are presented in such a manner that they seem to be originating from a trusted supplier, institution, or other service that the consumer has an affiliation with. These messages usually contain an appeal to the consumer to follow a link to a website, mimicking that of the specific trusted institution. They are then coerced into supplying personal and financial information to the fraudulent website which is subsequently unlawfully used for a range of criminal activities, including identity theft and financial exploitation. According to Dhamija et al. (2006), traditional indicators that could expose the abovementioned websites as being fraudulent are failing due to users being either inexperienced or ignorant about these indicators, or due to knowledgeable users being so convinced with the visual similarity to the site they know and trust, that they neglect to confirm the site’s validity. The mitigation of human error can be achieved through the implementation and strict application of policies and guidelines that govern human interaction with sensitive information. Policies however are only effective as long as the implementation thereof is actively monitored and enforced and even then these policies do no guarantee safety but merely limit the factor of possible human error that could lead to information losses (Cranor, 2008). In an attempt to limit the human factor in the detection of a phishing website, this paper presents a machine learning approach to identify whether a site would be safe to visit, or should be avoided, based on information contained in the