Intrusion detection using text processing techniques with a kernel based similarity measure Alok Sharma a, *, Arun K. Pujari b , Kuldip K. Paliwal a a Signal Processing Laboratory, Grifﬁth University (Nathan Campus), Brisbane, QLD-4111, Australia b AI Laboratory, University of Hyderabad, India article info Article history: Received 24 December 2006 Accepted 16 October 2007 Keywords: Intrusion detection kNN classiﬁer Similarity measure Anomaly detection Radial basis functions abstract This paper focuses on intrusion detection based on system call sequences using text processing techniques. It introduces kernel based similarity measure for the detection of host-based intrusions. The k-nearest neighbour (kNN) classiﬁer is used to classify a process as either normal or abnormal. The proposed technique is evaluated on the DARPA-1998 database and its performance is compared with other existing techniques available in the literature. It is shown that this technique is signiﬁcantly better than the other techniques in achieving lower false positive rates at 100% detection rate. ª 2007 Elsevier Ltd. All rights reserved. 1. Introduction Intrusion detection is an important technique in the defense- in-depth network security framework and a hot topic in com- puter network security in recent years. In general, intrusion de- tection system (IDS) is classiﬁed into two categories depending on the modeling methods used: signature-based and behav- iour-based detection. In signature-based detection, we have a collection of pre-deﬁned description or signatures of attacks. The current evidence is matched against these signatures to identify a possible threat. Although signature-based detection is effective in detecting known attacks, it cannot detect the new attacks that are not pre-deﬁned (whose signatures are not available). Behaviour-based detection, also known as anomaly detection, on the other hand, builds proﬁle of normal behav- iour and attempts to identify the patterns or activities that de- viate from the normal proﬁle. An important feature of anomaly detection is that it can detect unknown attacks. Behaviour modeling can be done by either modeling the user behaviour or process. While modeling process behaviour, system call data are one of the most common types of data used. Host- based anomaly detection systems mostly focus on system call sequences with the assumption that a malicious activity results in an abnormal trace. Such data can be collected by log- ging the system calls using operating system utilities, e.g. Linux strace or Solaris Basic Security Module (BSM). In this frame- work, it is assumed that the normal behaviour can be proﬁled by a set of patterns of sequence of system calls. Any deviation from the normal pattern is termed an intrusion in this frame- work. An intrusion detection system needs to learn the normal behaviour patterns from the previously collected data and this is normally accomplished by data mining or machine learning techniques. The problem of intrusion detection thus boils down to a supervised classiﬁcation problem to identify anom- alous sequences, which are measurably different from the nor- mal behaviour. The system call sequences of normal (as well as attack) instances are used as the training set. Though anomaly- based IDS is considered better than the signature-based IDS, * Corresponding author. E-mail address: sharma_al@usp.ac.fj (A. Sharma). available at www.sciencedirect.com journal homepage: www.elsevier.com/locate/cose 0167-4048/$ – see front matter ª 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.cose.2007.10.003 computers & security 26 (2007) 488–495