Swarm Intelligence Based Author Identification for Digital Typewritten Text Abdul Rauf Baig Dept. of Information Systems, College of Computer & Information Sciences, Al Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, Saudi Arabia raufbaig@ccis.imamu.edu.sa Hassan Mujtaba Kayani Dept. of Computer Science, National University of Computer & Emerging Sciences (NUCES) Islamabad, Pakistan hasan.mujtaba@nu.edu.pk AbstractIn this study we report our research on learning an accurate and easily interpretable classifier model for authorship classification of typewritten digital texts. For this purpose we use Ant Colony Optimization; a meta-heuristic based on swarm intelligence. Unlike black box type classifiers, the decision making rules produced by the proposed method are understandable by people familiar to the domain and can be easily enhanced with the addition of domain knowledge. Our experimental results show that the method is feasible and more accurate than decision trees. Keywordsdigital forensic; author identification; author verification; data mining; machine learning I. INTRODUCTION With the rapid proliferation of computers and networking during the last 20 years, our society has changed. Previously unknown issues have become a major concern. Cybercrimes is one such phenomenon. A crime committed with the help of computers and networking is called a cybercrime. The definition also includes those crimes that target the computer or network. Common examples of cybercrimes are fraud, identity theft, phishing, network intrusion, denial-of-service attacks, and spreading of computer viruses. One of the main causes of the rapid spread of cybercrimes is the anonymity of the users of Internet. Internet usage did not evolve with user identification as a concern. Unlike the users of many other commonly used systems, (e.g. banking, electricity, international traveling), individuals need not divulge their real identity in the cyberspace. Until our society is ready for some disciplinary measures, such as, unique identity of a person accessing the Internet (transmitted with each piece of communication), we have to find algorithmic ways of finding the identity of Internet users (particularly in situations where a cybercrime has been committed or being perpetuated). The present research is focused on finding a method that can be used to detect the authorship of electronic text in a reliable and comprehensible (as opposed to black box) way. The key to authorship identification is hidden in the way different people write. It is a commonly observed phenomenon that people can detect when a friend suddenly write in a manner that is not his usual custom. Every person has a writing style that is usually stable across different specimens of his writing. This style can be represented by features, such as, choice and use of words, organization of sentences and paragraphs, and use of function words. These features can then be used for authorship identification. In the current, we present a procedure for using these features for learning a classifier model and then use this model for identifying authors of unseen input text. Our classifier is based on a meta-heuristic known as Ant Colony Optimization (ACO). This meta-heuristic comes under the aegis of swarm intelligence. The main advantages of such a classifier are accuracy and comprehensibility. Unlike black box type classifiers, e.g. SVM, the decision making rules produced and used by an ACO classifier are understandable by experts of the domain. The remaining paper is structured in the following manner. In Section 2, we present the fundamental concepts associated with authorship identification and give an overview of related work. Section 3 outlines our proposed approach. The detail of the dataset and feature set used in the experiments is given in Section 4. In the same section we present the experimental results and discuss them. In Section 5, conclusion of the present research effort is given and future research directions are presented. II. FUNDAMENTALS AND RELATED WORK Classification is one of the basic tasks in data mining [1]. Some of the popular and commonly used classification methods are decision trees, artificial neural networks, support vector machines (SVM), and k-nearest neighbor classifiers. Classification of text on the basis of its author (or in other words author identification) is the determination of the writer, given a closed set of writers and a sample of text. This area comes under the umbrella of stylometric studies [2]. 978-1-4799-7620-1/15/$31.00 ©2015 IEEE