Writing Style Determination Using the KNN Text Model Oleg Granichin, Natalia Kizhaeva, Dmitry Shalymov, and Zeev Volkovich Abstract— The aim of the paper is writing style investigation. The method used is based on re-sampling approach. We present the text as a series of characters generated by distinct probability sources. A re-sampling procedure is applied in order to simulate samples from the texts. To check if samples are generated from the same population we use a KNN-based two-sample test. The proposed method shows high ability to distinguish variety of different texts. Index Terms— Writing style, authorship attribution, two- sample test, re-sampling. I. I NTRODUCTION In this paper the problem of writing style determination is studied. Writing style is a set of distinctive words, various grammatical structures, or any other measurable pattern that makes the piece of writing unique [1]. When it needs to determine the author of an anonymous text based on the writing style the methods of authorship attribution (AA) are often used. It is supposed for AA that there is a training set of documents with known authorship. The problem of AA attracts a lot of attention in the last two decades due to numerous applications. Therefore, the development of new computational methods for AA has become very topical. The methods of control theory can be effectively applied to the creation of new methods for data mining and computational intelligence [2]. The AA techniques find use in such attribution problems as author verification (i.e., to decide whether a given text was written by a certain author) [3], plagiarism detection (i.e., to assess similarity of two texts) [4], author profiling or characterization (i.e., to provide information on the age, education, sex, etc., of the author of a given text) [5] and others. We address the problem of writing style determination using a comparison of the ’randomness’ of two given texts. One of the tools, which is reasonably applicable to this pur- pose, is the two-sample test methodology intended to check if two given samples are drawn from the same population. The Kolmogorov-Smirnov (KS) test is a classical approach for this case. O. Granichin, N. Kizhaeva, and D. Shalymov are with the Saint Petersburg State University (Faculty of Mathematics and Mechanics, and Research Laboratory for Analysis and Modeling of Social Processes), 198504, Universitetskii pr. 28, St. Petersburg, Russia, O. Granichin is also with the Institute of Problems in Mechanical Engineering, Russian Academy of Sciences, Z. Volkovich is with the Dept. Soft. Engn., The Ort Braude, Karmiel, Israel, oleg granichin@mail.ru, natalia.kizhaeva@gmail.com, shalydim@mail.com, zeev53@mail.ru This work was financially supported by the SPbSU (project 15.61.2219.2013 and 6.37.181.2014) and, in part, by the RFBR (grants 13- 07-00250 and 15-08-02640). The normal distribution cannot be confidently established as the limit one here because a text written by one of more co-authors hardly appears to be generated by a sin- gle random source. To stabilize the process we apply the following multistage approach. At the first step, we evaluate the null hypothesis distribution, assuming concurrency of the considered writing styles, by comparing samples drawn from the text. Further, samples drawn separately from different texts are matched in order to get the appropriate p-values calculated with respect to the constructed null hypothesis distribution. In the case of the identical writing styles these p-values are uniformly distributed on the interval [0, 1]. We compare the obtained p-values distribution with the uniform one by means of a univariate two-sample KS-test. The article is organized as follows. Section II reviews related work. In section III an overview of Two-Sample Test methodology is provided. Section IV presents the sampling and comparing algorithms. The results of numerical experi- ments are shown in Section V, followed by Conclusions and Future work discussion. II. RELATED WORK One of the most influential works in this field is the work of Mosteller and Wallace (1964) [6] where the authorship of ’The Federalist Papers’ (a collection of 85 articles and essays written by Alexander Hamilton, James Madison, and John Jay) was assigned based on Bayesian statistical analysis of common word frequencies. It opened up the field to the exploration of new types of stylometric features and new modeling techniques [1]. Since then, various textual measurements were proposed. The simplest ones come from common descriptive statistics. Average word lengths, relative frequency, number of words per sentence, distribution of parts of speech can be easily calculated. They are obtainable for any natural language and corpus (if a proper tokenizer is available), and are useful for evaluating the writing style. However, none of these features robustly separates different authors in a large number of cases [9]. The next level of measures are character features, accord- ing to which, a text is viewed as a sequence of characters [1]. The measures include alphabetic and digit characters count, uppercase and lowercase characters count, letter frequencies, punctuation marks count, and so on [10]. More sophisticated approach is to explore frequencies of character n-grams. Among the advantages of this approach are the ability to capture context, use of punctuation, tolerance to grammatical errors. 2015 IEEE International Symposium on Intelligent Control (ISIC) Part of 2015 IEEE Multi-Conference on Systems and Control September 21-23, 2015. Sydney, Australia 978-1-4799-7798-7/15/$31.00 ©2015 IEEE 900