89 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 F. Iqbal et al., Machine Learning for Authorship Attribution and Cyber Forensics, International Series on Computer Entertainment and Media Technology, https://doi.org/10.1007/978-3-030-61675-5_7 Chapter 7 Authorship Characterization A problem of authorship characterization is to determine the sociolinguistic charac- teristics of the potential author of a given anonymous text message. Unlike the prob- lems of authorship attribution, where the potential suspects and their training samples are accessible for investigation, no candidate list of suspects is available in authorship characterization. Instead, the investigator is given one or more anony- mous documents and is asked to identify the sociolinguistic characteristics of the potential author of the documents in question. Sociolinguistic characteristics include ethnicity, age, gender, level of education, and religion [153]. In this chapter, the worst-case scenario of authorship characterization has been analyzed. In this worst-case scenario, the data from the sample population is not extensive enough to build a classifer. The proposed approach depicted in Fig. 7.1 frst applies stylometry-based clustering to the given anonymous messages to iden- tify major stylistic groups. The said approach suggests that clustering by stylometric features allows all messages written by one author to be clustered into one group. A group of messages is denoted by C i ∈ {C 1 , ···, C n }. Following this, a model is trained on messages collected from the sample population. The developed model is employed to infer the characteristics of a potential author. In this experiment, a blog dataset is used, as most bloggers voluntarily give away personal information on their blogs thus allowing characteristics to be determined. The selected bloggers need to be from the same class category that we want to infer in these experiments. For instance, to infer the gender of an author, we need to col- lect blog postings of male and female bloggers. Next, the Writeprint of each class category of the sample population is precisely modeled by employing the concept of frequent patterns [139]. The extracted Writeprints are then used to identify the class label of each message in a cluster C i . The approach is used to predict two characteristics: gender, and region or location. In the remainder of this chapter, the term “online messages” has been used to refer to e-mail messages, blog postings, and chat logs.