Lecture Notes in Management Science (2015) Vol. 7, 50–55 ISSN 2008-0050 (Print), ISSN 1927-0097 (Online)
Copyright © ORLab Analytics Inc. All rights reserved.
www.orlabanalytics.ca
Fraudulent URL Classification with an RBFNN
Dirk Snyman
1
, Tiny du Toit
1
and Hennie Kruger
1
1
North-West University, Potchefstroom Campus, Potchefstroom, South Africa
{dirk.snyman, tiny.dutoit, hennie.kruger}@nwu.ac.za
Proc. ICAOR 2015
Vienna, Austria
Abstract
Keywords:
Neural networks
Artificial intelligence
Machine learning
Phishing attacks are a social engineering scam which aim to steal sensitive information like personal, financial
and social details, from unsuspecting consumers. It is based on the premise that a significant number of
consumers are ignorant of the technical security practices in a digital environment. These attacks establish
trust from the user in order to lull them into a false sense of familiarity in which they are more likely to
freely supply their identifying information. In this study an intelligent approach to phishing detection is
presented. The method is based on the information contained in the Uniform Resource Locator (URL) that
points to a phishing website. These URLs are analysed and classified as malicious or safe, by a new automated
Radial Basis Function Neural Network (RBFNN) construction algorithm. This technique determines the
best neural network architecture for a specific classification task and uses an in-sample model selection
criterion. Example URLs which were collected from real world repositories, Open Directory and
Phishtank, are used to train and evaluate the neural network. The results of a cross-validation experiment
setup are presented. It was found that the RBFNN algorithm outperforms a naïve Bayes classifier baseline
which is considered to be a standard text classification baseline in literature due to its simple structure and
linear execution.
Introduction
In a range of publications of information security analysis over an extended period, Richardson (2010; 2008) and Berger (2012)
state that viruses and malware account for the most commonly seen type of malicious intrusions noted in the corporate
environment. Of the surveyed users, 49% reported incidents leading to financial losses and 67.1% reported information
losses. These reported incidents extend beyond the corporate environment and also include personal losses. This poses a
great risk to companies and individuals in the digital domain. Human malicious actions (in terms of wilful information security
breaches or exploits from within the company) accounts for a much smaller fraction of the reported incidents of security
violations. Richardson (2010; 2008) and Berger (2012) report that only 3 percent of the losses are attributed to malicious
intent by insiders. The losses due to human negligence or error, without the intent being to breach information security, is
reported to be at 14.5%; a larger number than malicious intent and arguably one of the most preventable issues.
One type of malicious endeavour that relies on and specifically exploits human error and ignorance is phishing
(APWG, 2014; Jacobson, 2007; Dhamija et al., 2006). The Anti-Phishing Working Group (2014) describes phishing as a
malicious endeavor that aims to steal sensitive information like personal, financial and social details, from unsuspecting
consumers by employing social engineering approaches in a technical environment. These approaches are usually based
on fraudulent electronic messages that are sent to consumers which are presented in such a manner that they seem to be
originating from a trusted supplier, institution, or other service that the consumer has an affiliation with. These messages
usually contain an appeal to the consumer to follow a link to a website, mimicking that of the specific trusted institution.
They are then coerced into supplying personal and financial information to the fraudulent website which is subsequently
unlawfully used for a range of criminal activities, including identity theft and financial exploitation.
According to Dhamija et al. (2006), traditional indicators that could expose the abovementioned websites as being
fraudulent are failing due to users being either inexperienced or ignorant about these indicators, or due to knowledgeable
users being so convinced with the visual similarity to the site they know and trust, that they neglect to confirm the site’s
validity. The mitigation of human error can be achieved through the implementation and strict application of policies and
guidelines that govern human interaction with sensitive information. Policies however are only effective as long as the
implementation thereof is actively monitored and enforced and even then these policies do no guarantee safety but merely
limit the factor of possible human error that could lead to information losses (Cranor, 2008).
In an attempt to limit the human factor in the detection of a phishing website, this paper presents a machine learning
approach to identify whether a site would be safe to visit, or should be avoided, based on information contained in the