Automatic Classification of Cross-Site Scripting in Web Pages Using Document-based and URL-based Features Angelo Eduardo Nunan, Eduardo Souto, Eulanda M. dos Santos, Eduardo Feitosa Institute of Computing (ICOMP) Federal University of Amazonas Av. Gal. Rodrigo Octávio Jordão Ramos, 3000 CEP 69.077-000 – Manaus-Brazil {nunan, esouto, emsantos, efeitosa}@icomp.ufam.edu.br Abstract— The structure of dynamic websites comprised of a set of objects such as HTML tags, script functions, hyperlinks and advanced features in browsers lead to numerous resources and interactiveness in services currently provided on the Internet. However, these features have also increased security risks and attacks since they allow malicious codes injection or XSS (Cross- Site Scripting). XSS remains at the top of the lists of the greatest threats to web applications in recent years. This paper presents the experimental results obtained on XSS automatic classification in web pages using Machine Learning techniques. We focus on features extracted from web document content and URL. Our results demonstrate that the proposed features lead to highly accurate classification of malicious page. Keywords- cross-site scripting; scripting languages security; web application security; machine learning. I. INTRODUCTION Dynamic web applications play an important role in providing resources manipulation and interaction between clients and servers. The features currently supported by browsers as HTML tags, scripts, hyperlinks, and advanced functions have increased business opportunities, providing greater interactivity in Web-based services, such as e- commerce, Internet banking, social networking, blogs, forums, among others. On the other hand, such features also increased vulnerabilities and threats which allow malicious exploiting. According to Wasserman and Su [1], most of the web programming languages do not provide, by default, a safe data transfer to the client. The absence of this procedure can lead to one of the most frequent attacks in web applications, the Cross-Site Scripting (XSS). According to Uto and Melo [2], XSS is an attack that exploits a vulnerable web application, used to carry malicious code usually written in JavaScript, to other user browser. The focus of this vulnerability is the lack of user input data validation [2, 3]. Researches show that XSS remains at the top of the lists of the greatest vulnerabilities in web applications in recent years [4]. In order to deal with the large volume and range of XSS attacks, different approaches and techniques have been proposed in the literature, among which we highlight the use of formal languages and automata [1], primitive markup language elements [5, 6], blacklists, whitelists [7], combinations of techniques [8], etc. However, machine learning techniques have been successfully used to detect web-based anomaly [9, 10, 11]. In this paper, we identify a set of features that allows the accurate automatic classification of XSS in web pages based on supervised machine learning techniques. We apply two machine learning methods, namely Naive Bayes and Support Vector Machines, to classify web pages using a dataset composed of 216.054 websites, where 15.366 of these samples correspond to the attacks occurred from June, 23, 2008 to August, 02, 2011, obtained from XSSed dataset (http://www.xssed.com). We evaluate both classifiers in terms of three criteria, i.e., detection, accuracy and false alarm rates. The remainder of this paper is organized as follows: Section II introduces the concepts related to XSS and discusses some related works. Section III describes the proposed features for automatic classification of XSS attacks in Web pages. Section IV presents the experimental results and their analysis. Finally, Section V presents conclusions and future work. II. UNDERSTANDING CROSS-SITE SCRIPTING (XSS) Grossman [12] defines Cross-Site Scripting (XSS) as an attack vector caused by malicious scripts on the client or server, where data from user input is not properly validated. This allows the theft of confidential information and user sessions, as well as it compromises the client’s browser and the running system integrity. The script codes used in XSS are typically developed in JavaScript and embedded in the HTML structure [2, 12]. However, technologies such as Active X, Flash or any other technology supported by browsers can also be used as a vector [12]. The XSS attacks can be categorized as Persistent, Reflective and DOM-based [3]. In the first case, the malicious code is permanently stored on server resources. Persistent is the most dangerous type of XSS [3]. In the second case, the code runs in the client browser without being stored on the server. This attack is usually made possible through links to malicious code injection. According to the OWASP (Open Web Application Security Project) [3], this is the most frequent type of XSS attack. Finally, instead of using malicious code embedded into the page that is returned to the client browser, the DOM-based XSS enables dynamic scripts on components of the document, modifying the DOM environment (Document 978-1-4673-2713-8/12/$31.00 ©2012 IEEE 000702