Swarm Intelligence Based Author Identification for
Digital Typewritten Text
Abdul Rauf Baig
Dept. of Information Systems,
College of Computer & Information Sciences,
Al Imam Mohammad Ibn Saud Islamic
University (IMSIU), Riyadh, Saudi Arabia
raufbaig@ccis.imamu.edu.sa
Hassan Mujtaba Kayani
Dept. of Computer Science,
National University of Computer & Emerging Sciences
(NUCES)
Islamabad, Pakistan
hasan.mujtaba@nu.edu.pk
Abstract—In this study we report our research on learning an
accurate and easily interpretable classifier model for authorship
classification of typewritten digital texts. For this purpose we use
Ant Colony Optimization; a meta-heuristic based on swarm
intelligence. Unlike black box type classifiers, the decision
making rules produced by the proposed method are
understandable by people familiar to the domain and can be
easily enhanced with the addition of domain knowledge. Our
experimental results show that the method is feasible and more
accurate than decision trees.
Keywords—digital forensic; author identification; author
verification; data mining; machine learning
I. INTRODUCTION
With the rapid proliferation of computers and networking
during the last 20 years, our society has changed. Previously
unknown issues have become a major concern. Cybercrimes is
one such phenomenon. A crime committed with the help of
computers and networking is called a cybercrime. The
definition also includes those crimes that target the computer or
network. Common examples of cybercrimes are fraud, identity
theft, phishing, network intrusion, denial-of-service attacks,
and spreading of computer viruses. One of the main causes of
the rapid spread of cybercrimes is the anonymity of the users of
Internet. Internet usage did not evolve with user identification
as a concern. Unlike the users of many other commonly used
systems, (e.g. banking, electricity, international traveling),
individuals need not divulge their real identity in the
cyberspace. Until our society is ready for some disciplinary
measures, such as, unique identity of a person accessing the
Internet (transmitted with each piece of communication), we
have to find algorithmic ways of finding the identity of Internet
users (particularly in situations where a cybercrime has been
committed or being perpetuated).
The present research is focused on finding a method that
can be used to detect the authorship of electronic text in a
reliable and comprehensible (as opposed to black box) way.
The key to authorship identification is hidden in the way
different people write. It is a commonly observed phenomenon
that people can detect when a friend suddenly write in a
manner that is not his usual custom. Every person has a writing
style that is usually stable across different specimens of his
writing. This style can be represented by features, such as,
choice and use of words, organization of sentences and
paragraphs, and use of function words. These features can then
be used for authorship identification.
In the current, we present a procedure for using these
features for learning a classifier model and then use this model
for identifying authors of unseen input text. Our classifier is
based on a meta-heuristic known as Ant Colony Optimization
(ACO). This meta-heuristic comes under the aegis of swarm
intelligence. The main advantages of such a classifier are
accuracy and comprehensibility. Unlike black box type
classifiers, e.g. SVM, the decision making rules produced and
used by an ACO classifier are understandable by experts of the
domain.
The remaining paper is structured in the following manner.
In Section 2, we present the fundamental concepts associated
with authorship identification and give an overview of related
work. Section 3 outlines our proposed approach. The detail of
the dataset and feature set used in the experiments is given in
Section 4. In the same section we present the experimental
results and discuss them. In Section 5, conclusion of the
present research effort is given and future research directions
are presented.
II. FUNDAMENTALS AND RELATED WORK
Classification is one of the basic tasks in data mining [1].
Some of the popular and commonly used classification
methods are decision trees, artificial neural networks, support
vector machines (SVM), and k-nearest neighbor classifiers.
Classification of text on the basis of its author (or in other
words author identification) is the determination of the writer,
given a closed set of writers and a sample of text. This area
comes under the umbrella of stylometric studies [2].
978-1-4799-7620-1/15/$31.00 ©2015 IEEE