Abstract- In this paper, a novel recursive data mining method based on the simple but powerful model of cognition called a conceptor is introduced and applied to computer security. The method recursively mines a string of symbols by finding frequent patterns, encoding them with unique symbols and rewriting the string using this new coding. We apply this technique to two related but important problems in computer security: (i) masquerade detection to prevent a security attack in which an intruder impersonates a legitimate user to gain access to the resources, and (ii) author identification, in which anonymous or disputed computer session needs to be attributed to one of a set of potential authors. Many methods based on automata theory, Hidden Markov Models, Bayesian models or even matching algorithms from bioinformatics have been proposed to solve the masquerading detection problem but less work has been done on the author identification. We used recursive data mining to characterize the structure and high-level symbols in user signatures and the monitored sessions. We used one-class SVM to measure the similarity of these two characterizations. We applied weighting prediction scheme to author identification. On the SEA dataset that we used in our experiments, the results were very promising. Keywords- Masquerade detection, author identification, recursive data mining, one-class SVM, intrusion detection I. INTRODUCTION This paper focuses on two related and important topics in system security: masquerade detection and author identification. Masquerade detection is often considered the most serious and challenging problem in computer security. Masquerader hides his/her identity by impersonating a legitimate user in a computer system or network and may maliciously damage the system. The typical ways in which masquerade attacks succeed include: obtaining a legitimate user’s password, accessing an unattended and unlocked workstation, forging email address in messages, overtaking a computer via a network access. Masquerade detection is challenging for the following reasons: (i) masqueraders entering the system as valid users cannot be detected by the existing access control or authentication, (ii) by perfectly mimicking user’s behavior, masqueraders are undetectable, and (iii) the legitimate user may be detected as a masquerader if the user’s behavior changes. To enable masquerade detection, a string from a legitimate user is collected and used to generate a signature containing some attributes (features) of this user. This signature is then compared to the attributes generated from the currently monitored string of the potential masqueraders. If normal and intrusion activities are sufficiently distinct, attributes generated from the legitimate user activities will be more similar to the user’s signature than those generated from the masquerader’s session. Most previous research follows this logic to distinguish the strings from legitimate users and masqueraders. A related problem to masquerade detection is identifying the potential internal masqueraders, which can be generalized as an author identification problem. This problem is relevant in secured environments, in which only a small number of users with known signatures can originate an attack. Other examples of usefulness of the author identification problem include finding equivalences between emails originated from differently named accounts or detecting plagiarizing among papers or programs. For author identification, a string from each potential author is collected to generate a signature (some attributes or features). Each signature will then be compared to the attributes of the currently monitored string. The author is then decided based on the degree of similarity of the current session and the author signature. In masquerade detection and author identification, the input is a string of objects (commands, packets, system calls, lines of program execution trace, etc.) produced by a source. The task is to assess whether the monitored string confirms to the “usual” behavior of this source, in case of intrusion detection, or which of many possible sources is the most likely producer of the monitored string, in the authorship identification case. The assessment is based on the unique signature of each source collected in the controlled experiment in which the authorship of the signature can be assured. Boleslaw K. Szymanski, IEEE Fellow, and Yongqiang Zhang Department of Computer Science, RPI, Troy, NY 12180, USA Recursive Data Mining for Masquerade Detection and Author Identification Proc. 5th IEEE System, Man and Cybernetics Information Assurance Workshop, West Point, NY, June, 2004, pp. 424-431