Automatic Classification of Cross-Site Scripting in Web Pages Using
Document-based and URL-based Features
Angelo Eduardo Nunan, Eduardo Souto, Eulanda M. dos Santos, Eduardo Feitosa
Institute of Computing (ICOMP)
Federal University of Amazonas
Av. Gal. Rodrigo Octávio Jordão Ramos, 3000
CEP 69.077-000 – Manaus-Brazil
{nunan, esouto, emsantos, efeitosa}@icomp.ufam.edu.br
Abstract— The structure of dynamic websites comprised of a set
of objects such as HTML tags, script functions, hyperlinks and
advanced features in browsers lead to numerous resources and
interactiveness in services currently provided on the Internet.
However, these features have also increased security risks and
attacks since they allow malicious codes injection or XSS (Cross-
Site Scripting). XSS remains at the top of the lists of the greatest
threats to web applications in recent years. This paper presents
the experimental results obtained on XSS automatic
classification in web pages using Machine Learning techniques.
We focus on features extracted from web document content and
URL. Our results demonstrate that the proposed features lead to
highly accurate classification of malicious page.
Keywords- cross-site scripting; scripting languages security; web
application security; machine learning.
I. INTRODUCTION
Dynamic web applications play an important role in
providing resources manipulation and interaction between
clients and servers. The features currently supported by
browsers as HTML tags, scripts, hyperlinks, and advanced
functions have increased business opportunities, providing
greater interactivity in Web-based services, such as e-
commerce, Internet banking, social networking, blogs,
forums, among others.
On the other hand, such features also increased
vulnerabilities and threats which allow malicious exploiting.
According to Wasserman and Su [1], most of the web
programming languages do not provide, by default, a safe
data transfer to the client. The absence of this procedure can
lead to one of the most frequent attacks in web applications,
the Cross-Site Scripting (XSS). According to Uto and Melo
[2], XSS is an attack that exploits a vulnerable web
application, used to carry malicious code usually written in
JavaScript, to other user browser. The focus of this
vulnerability is the lack of user input data validation [2, 3].
Researches show that XSS remains at the top of the lists
of the greatest vulnerabilities in web applications in recent
years [4]. In order to deal with the large volume and range of
XSS attacks, different approaches and techniques have been
proposed in the literature, among which we highlight the use
of formal languages and automata [1], primitive markup
language elements [5, 6], blacklists, whitelists [7],
combinations of techniques [8], etc. However, machine
learning techniques have been successfully used to detect
web-based anomaly [9, 10, 11].
In this paper, we identify a set of features that allows the
accurate automatic classification of XSS in web pages based
on supervised machine learning techniques. We apply two
machine learning methods, namely Naive Bayes and Support
Vector Machines, to classify web pages using a dataset
composed of 216.054 websites, where 15.366 of these
samples correspond to the attacks occurred from June, 23,
2008 to August, 02, 2011, obtained from XSSed dataset
(http://www.xssed.com). We evaluate both classifiers in
terms of three criteria, i.e., detection, accuracy and false
alarm rates.
The remainder of this paper is organized as follows:
Section II introduces the concepts related to XSS and
discusses some related works. Section III describes the
proposed features for automatic classification of XSS attacks
in Web pages. Section IV presents the experimental results
and their analysis. Finally, Section V presents conclusions
and future work.
II. UNDERSTANDING CROSS-SITE SCRIPTING (XSS)
Grossman [12] defines Cross-Site Scripting (XSS) as an
attack vector caused by malicious scripts on the client or
server, where data from user input is not properly validated.
This allows the theft of confidential information and user
sessions, as well as it compromises the client’s browser and
the running system integrity. The script codes used in XSS
are typically developed in JavaScript and embedded in the
HTML structure [2, 12]. However, technologies such as
Active X, Flash or any other technology supported by
browsers can also be used as a vector [12].
The XSS attacks can be categorized as Persistent,
Reflective and DOM-based [3]. In the first case, the
malicious code is permanently stored on server resources.
Persistent is the most dangerous type of XSS [3]. In the
second case, the code runs in the client browser without
being stored on the server. This attack is usually made
possible through links to malicious code injection.
According to the OWASP (Open Web Application Security
Project) [3], this is the most frequent type of XSS attack.
Finally, instead of using malicious code embedded into the
page that is returned to the client browser, the DOM-based
XSS enables dynamic scripts on components of the
document, modifying the DOM environment (Document
978-1-4673-2713-8/12/$31.00 ©2012 IEEE 000702