International Journal of Computer Applications (0975 – 8887) Volume 86 – No 12, January 2014 22 Authorship Analysis Studies: A Survey Sara El Manar El Bouanani ENSIAS - Mohammed V Souissi University Mohammed Ben Abdallah Regragui, Madinat Al Irfane, BP 713, Agdal Rabat, Morocco Ismail Kassou ENSIAS - Mohammed V Souissi University Mohammed Ben Abdallah Regragui, Madinat Al Irfane, BP 713, Agdal Rabat, Morocco ABSTRACT The objective in this paper is to provide a review of the different studies done on authorship analysis. Focus is on outlining the Stylometric features that allow distinguishing between authors and on listing the diverse techniques used to classify an author’s texts. General Terms Authorship analysis Keywords Authorship characterization, authorship attribution, similarity detection, Stylometric features, probabilistic models, compression models, Machine learning classifiers, clustering algorithms, inter-textual distance. 1. INTRODUCTION Authorship analysis is the process of examining the characteristics of a piece of work in order to draw conclusions on its authorship. Authorship analysis has its roots in a linguistic research area called stylometry, which refers to statistical analysis of literally style [26]. Authorship analysis measures some textual features and avoids distinguishing between texts written by different authors. First studies go back to 19th century with the study of Shakespeare’s plays on 1887 [49] followed by statistical studies in the first half of the 20th century ([58], [59], [61]). The study of the Federalist Papers [50] on 1964 was considered the most influential work in authorship attribution [56]. Authorship analysis studies can be classified into three categories ([1], [24] and [26]): • Authorship attribution or identification determines the likelihood of a particular author having written a piece of work by examining other works produced by that author. • Authorship profiling or characterization determines the author’s profile or the characteristics of the author that produced a given piece of work. These characteristics include gender, educational background, cultural background and language familiarity. • Similarity detection compares multiple pieces of work and determines whether or not they are produced by a single author without necessarily identifying the author. Similarity is often used in the context of plagiarism detection which involves the complete or partial replication of a piece of work with or without permission of the original author. Authorship analysis has been used in a diverse number of application areas. Many previous authorship studies focused on analyzing texts in the literature ([18], [19], [33], [40] and [50]), program codes ([27], [37] and [45]) and online messages ([1], [8], [11], [22] and [25]). However, with the growth of the web application and social networks, studies in the last decade focus on analyzing online messages (e-mails, blogs, forum…) rather than literary texts [56]. Therefore, the purpose in this paper is to present a survey of the studies and the techniques used in the field of authorship analysis. The paper is organized as follows. In Section 2, different areas of authorship analysis are described. Section 3 outlines the Stylometric features which are used to distinguish a text from another. Section 4 provides a survey of the different techniques used to detect authors. In Section 5, a brief conclusion is given. 2. AUTHORSHIP ANALYSIS 2.1 Authorship attribution Authorship attribution is particularly concerned with the identification of the real author of a disputed anonymous document. In the literature, authorship identification is considered as a text categorization or text classification problem. The process starts by data cleaning followed by feature extraction and normalization. Each suspected document is converted into a feature vector [42]; the suspect represents the class label. Feature values are calculated by using Stylometric features. The extracted features are classified into two groups: training and testing sets. The training set is used to develop a classification model whereas the testing set is used to validate the developed model by assuming the class labels are not known. Common classifiers include decision trees, neural networks and Support Vector Machine [42]. Authorship attribution studies differ in terms of the Stylometric features used and the type of classifiers employed. References [30] and [47] describe two approaches which attempt to mine e- mail authorship for the purpose of computer forensics. Authors extract various e-mail document features including linguistic features, header features, linguistic patterns and structural characteristics. All these features are used with the Support Vector Machine (SVM) learning algorithm to attribute authorship of e-mail messages to an author. Reference [26] develops a framework for authorship identification in online messages to deal with the identity- tracing problem. In this framework, four types of writing style features (lexical, syntactic, structural and content-specific features) are extracted from English and Chinese online- newsgroup messages. Comparison has been made between three classification techniques: decision tree, SVM and back- propagation neural networks. Experimental results showed that this framework is able to identify authors with a satisfactory accuracy of 70 to 95% and the SVM classifier outperformed the two others. Reference [60] uses only function words and applies five classifiers (Naive Bayesian, Bayesian networks, Nearest- neighbor method, Decision Trees, SVM). The data analyzed is a collection of newswire articles from the AP (Associated Press) sub-collection. 2.2 Authorship Characterization Authorship characterization is used to detect sociolinguistic attributes like gender, age, occupation and educational level of the potential author of an anonymous document [42].