89 © The Editor(s) (if applicable) and The Author(s), under exclusive license to
Springer Nature Switzerland AG 2020
F. Iqbal et al., Machine Learning for Authorship Attribution and Cyber
Forensics, International Series on Computer Entertainment and Media
Technology, https://doi.org/10.1007/978-3-030-61675-5_7
Chapter 7
Authorship Characterization
A problem of authorship characterization is to determine the sociolinguistic charac-
teristics of the potential author of a given anonymous text message. Unlike the prob-
lems of authorship attribution, where the potential suspects and their training
samples are accessible for investigation, no candidate list of suspects is available in
authorship characterization. Instead, the investigator is given one or more anony-
mous documents and is asked to identify the sociolinguistic characteristics of the
potential author of the documents in question. Sociolinguistic characteristics include
ethnicity, age, gender, level of education, and religion [153].
In this chapter, the worst-case scenario of authorship characterization has been
analyzed. In this worst-case scenario, the data from the sample population is not
extensive enough to build a classifer. The proposed approach depicted in Fig. 7.1
frst applies stylometry-based clustering to the given anonymous messages to iden-
tify major stylistic groups. The said approach suggests that clustering by stylometric
features allows all messages written by one author to be clustered into one group. A
group of messages is denoted by C
i
∈ {C
1
, ···, C
n
}. Following this, a model is trained
on messages collected from the sample population. The developed model is
employed to infer the characteristics of a potential author.
In this experiment, a blog dataset is used, as most bloggers voluntarily give away
personal information on their blogs thus allowing characteristics to be determined.
The selected bloggers need to be from the same class category that we want to infer
in these experiments. For instance, to infer the gender of an author, we need to col-
lect blog postings of male and female bloggers. Next, the Writeprint of each class
category of the sample population is precisely modeled by employing the concept
of frequent patterns [139]. The extracted Writeprints are then used to identify the
class label of each message in a cluster C
i
. The approach is used to predict two
characteristics: gender, and region or location. In the remainder of this chapter, the
term “online messages” has been used to refer to e-mail messages, blog postings,
and chat logs.