Unsupervised learning techniques for an intrusion detection system Stefano Zanero zanero@elet.polimi.it Sergio M. Savaresi savaresi@elet.polimi.it Dipartimento di Elettronica e Informazione, Politecnico di Milano Piazza L. da Vinci, 32; 20133, Milan, Italy ABSTRACT With the continuous evolution of the types of attacks against computer networks, traditional intrusion detection systems, based on pattern matching and static signatures, are in- creasingly limited by their need of an up-to-date and com- prehensive knowledge base. Data mining techniques have been successfully applied in host-based intrusion detection. Applying data mining techniques on raw network data, how- ever, is made diﬃcult by the sheer size of the input; this is usually avoided by discarding the network packet contents. In this paper, we introduce a two-tier architecture to over- come this problem: the ﬁrst tier is an unsupervised cluster- ing algorithm which reduces the network packets payload to a tractable size. The second tier is a traditional anomaly de- tection algorithm, whose eﬃciency is improved by the avail- ability of data on the packet payload content. Categories and Subject Descriptors K.6.5 [Security and Protection]: Unauthorized access (e.g., hacking, phreaking); I.5.3 [Clustering]: Algorithms; C.2.3 [Network Operations]: Network monitoring General Terms Security, Experimentation. Keywords Intrusion detection, anomaly detection, unsupervised clus- tering, quality of clusters, K-means, principal direction di- visive partitioning, self-organizing maps. 1. INTRODUCTION AND MOTIVATIONS One of the most excruciating pains in both the intrusion and virus detection ﬁelds is the constant need for up-to- date deﬁnition of the attacks. This follows from the use of a “misuse detection” approach, which tries to deﬁne what Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’04 March 14-17 2004, Nicosia, Cyprus Copyright 2004 ACM 1-58113-812-1/03/04 ...$5.00. is anomalous instead of deﬁning what is normal. While this kind of approach has been widely successful and is imple- mented in almost all the modern antivirus and intrusion de- tection tools, its main drawback is that, when facing an un- known attack, misuse-based systems are substantially use- less. In the antivirus world this problem has been more or less successfully approached with round-the-clock response team and signature distribution methodologies. In the in- trusion detection world, maintaining such a knowledge base up to date is substantially a lost battle. The problem does not lie only in the sheer number of vul- nerabilities that are discovered every day: there is also an unknown number of unexposed vulnerabilities that may not be immediately available to the experts for analysis and in- clusion in the knowledge base (which, in general, does not happen for viral code). In addition, some forms of attacks could even be studied by a particularly skilled attacker on the spot, just to hit a single or a few systems (again, this is not what you would expect from a virus). In fact, misuse- based IDS are particularly eﬀective against the so-called “script kiddies”, unskilled attackers that rely on commonly known attack tools, for which a signature is usually widely available. Additionally, computer attacks are usually polymorph, since there are diﬀerent ways to exploit the same vulner- ability. Thus, it is correspondingly more diﬃcult to develop appropriate signatures: either we generate a number of sig- natures to cover each possible variation of the attack, or we try to generalize the signatures, risking to generate false positives. In some cases this is inherent to the attacks, for instance the “unicode” related bugs, since for each character there are multiple possible Unicode encodings. But let us examine the ADMutate tool (http://www.ktwo.ca/c/ADMmutate- 0.8.4.tar.gz), developed by the Canadian hacker “K2”: this tool enables an aggressor to encrypt the shellcode of a stack- smashing buﬀer overﬂow attack [21], and to append this en- crypted shellcode to the decryption algorithm. Even if now most IDS have a speciﬁc signature for the decryption code speciﬁc to ADMutate it’s easy to understand that this prin- ciple can be indeﬁnitely applied, in many forms. An obvious solution would be to go back to the basics, and try to implement an anomaly detection approach, modeling what is normal instead than what is anomalous. This is surprisingly similar to the earliest conceptions of what an IDS should do [1]. Surprisingly enough, anomaly detection systems have been