Intrusion detection using text processing techniques with a kernel based similarity measure Alok Sharma a, *, Arun K. Pujari b , Kuldip K. Paliwal a a Signal Processing Laboratory, Griffith University (Nathan Campus), Brisbane, QLD-4111, Australia b AI Laboratory, University of Hyderabad, India article info Article history: Received 24 December 2006 Accepted 16 October 2007 Keywords: Intrusion detection kNN classifier Similarity measure Anomaly detection Radial basis functions abstract This paper focuses on intrusion detection based on system call sequences using text processing techniques. It introduces kernel based similarity measure for the detection of host-based intrusions. The k-nearest neighbour (kNN) classifier is used to classify a process as either normal or abnormal. The proposed technique is evaluated on the DARPA-1998 database and its performance is compared with other existing techniques available in the literature. It is shown that this technique is significantly better than the other techniques in achieving lower false positive rates at 100% detection rate. ª 2007 Elsevier Ltd. All rights reserved. 1. Introduction Intrusion detection is an important technique in the defense- in-depth network security framework and a hot topic in com- puter network security in recent years. In general, intrusion de- tection system (IDS) is classified into two categories depending on the modeling methods used: signature-based and behav- iour-based detection. In signature-based detection, we have a collection of pre-defined description or signatures of attacks. The current evidence is matched against these signatures to identify a possible threat. Although signature-based detection is effective in detecting known attacks, it cannot detect the new attacks that are not pre-defined (whose signatures are not available). Behaviour-based detection, also known as anomaly detection, on the other hand, builds profile of normal behav- iour and attempts to identify the patterns or activities that de- viate from the normal profile. An important feature of anomaly detection is that it can detect unknown attacks. Behaviour modeling can be done by either modeling the user behaviour or process. While modeling process behaviour, system call data are one of the most common types of data used. Host- based anomaly detection systems mostly focus on system call sequences with the assumption that a malicious activity results in an abnormal trace. Such data can be collected by log- ging the system calls using operating system utilities, e.g. Linux strace or Solaris Basic Security Module (BSM). In this frame- work, it is assumed that the normal behaviour can be profiled by a set of patterns of sequence of system calls. Any deviation from the normal pattern is termed an intrusion in this frame- work. An intrusion detection system needs to learn the normal behaviour patterns from the previously collected data and this is normally accomplished by data mining or machine learning techniques. The problem of intrusion detection thus boils down to a supervised classification problem to identify anom- alous sequences, which are measurably different from the nor- mal behaviour. The system call sequences of normal (as well as attack) instances are used as the training set. Though anomaly- based IDS is considered better than the signature-based IDS, * Corresponding author. E-mail address: sharma_al@usp.ac.fj (A. Sharma). available at www.sciencedirect.com journal homepage: www.elsevier.com/locate/cose 0167-4048/$ – see front matter ª 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.cose.2007.10.003 computers & security 26 (2007) 488–495