Writing Style Determination Using the KNN Text Model
Oleg Granichin, Natalia Kizhaeva, Dmitry Shalymov, and Zeev Volkovich
Abstract— The aim of the paper is writing style investigation.
The method used is based on re-sampling approach. We
present the text as a series of characters generated by distinct
probability sources. A re-sampling procedure is applied in order
to simulate samples from the texts. To check if samples are
generated from the same population we use a KNN-based
two-sample test. The proposed method shows high ability to
distinguish variety of different texts.
Index Terms— Writing style, authorship attribution, two-
sample test, re-sampling.
I. I NTRODUCTION
In this paper the problem of writing style determination is
studied. Writing style is a set of distinctive words, various
grammatical structures, or any other measurable pattern that
makes the piece of writing unique [1]. When it needs to
determine the author of an anonymous text based on the
writing style the methods of authorship attribution (AA) are
often used. It is supposed for AA that there is a training set
of documents with known authorship.
The problem of AA attracts a lot of attention in the last
two decades due to numerous applications. Therefore, the
development of new computational methods for AA has
become very topical. The methods of control theory can be
effectively applied to the creation of new methods for data
mining and computational intelligence [2].
The AA techniques find use in such attribution problems
as author verification (i.e., to decide whether a given text
was written by a certain author) [3], plagiarism detection
(i.e., to assess similarity of two texts) [4], author profiling
or characterization (i.e., to provide information on the age,
education, sex, etc., of the author of a given text) [5] and
others.
We address the problem of writing style determination
using a comparison of the ’randomness’ of two given texts.
One of the tools, which is reasonably applicable to this pur-
pose, is the two-sample test methodology intended to check
if two given samples are drawn from the same population.
The Kolmogorov-Smirnov (KS) test is a classical approach
for this case.
O. Granichin, N. Kizhaeva, and D. Shalymov are with the Saint
Petersburg State University (Faculty of Mathematics and Mechanics, and
Research Laboratory for Analysis and Modeling of Social Processes),
198504, Universitetskii pr. 28, St. Petersburg, Russia, O. Granichin is
also with the Institute of Problems in Mechanical Engineering, Russian
Academy of Sciences, Z. Volkovich is with the Dept. Soft. Engn., The
Ort Braude, Karmiel, Israel, oleg granichin@mail.ru,
natalia.kizhaeva@gmail.com,
shalydim@mail.com, zeev53@mail.ru
This work was financially supported by the SPbSU (project
15.61.2219.2013 and 6.37.181.2014) and, in part, by the RFBR (grants 13-
07-00250 and 15-08-02640).
The normal distribution cannot be confidently established
as the limit one here because a text written by one of
more co-authors hardly appears to be generated by a sin-
gle random source. To stabilize the process we apply the
following multistage approach. At the first step, we evaluate
the null hypothesis distribution, assuming concurrency of the
considered writing styles, by comparing samples drawn from
the text. Further, samples drawn separately from different
texts are matched in order to get the appropriate p-values
calculated with respect to the constructed null hypothesis
distribution. In the case of the identical writing styles these
p-values are uniformly distributed on the interval [0, 1]. We
compare the obtained p-values distribution with the uniform
one by means of a univariate two-sample KS-test.
The article is organized as follows. Section II reviews
related work. In section III an overview of Two-Sample Test
methodology is provided. Section IV presents the sampling
and comparing algorithms. The results of numerical experi-
ments are shown in Section V, followed by Conclusions and
Future work discussion.
II. RELATED WORK
One of the most influential works in this field is the work
of Mosteller and Wallace (1964) [6] where the authorship
of ’The Federalist Papers’ (a collection of 85 articles and
essays written by Alexander Hamilton, James Madison, and
John Jay) was assigned based on Bayesian statistical analysis
of common word frequencies. It opened up the field to the
exploration of new types of stylometric features and new
modeling techniques [1].
Since then, various textual measurements were proposed.
The simplest ones come from common descriptive statistics.
Average word lengths, relative frequency, number of words
per sentence, distribution of parts of speech can be easily
calculated. They are obtainable for any natural language and
corpus (if a proper tokenizer is available), and are useful for
evaluating the writing style. However, none of these features
robustly separates different authors in a large number of cases
[9].
The next level of measures are character features, accord-
ing to which, a text is viewed as a sequence of characters [1].
The measures include alphabetic and digit characters count,
uppercase and lowercase characters count, letter frequencies,
punctuation marks count, and so on [10]. More sophisticated
approach is to explore frequencies of character n-grams.
Among the advantages of this approach are the ability to
capture context, use of punctuation, tolerance to grammatical
errors.
2015 IEEE International Symposium on Intelligent Control (ISIC)
Part of 2015 IEEE Multi-Conference on Systems and Control
September 21-23, 2015. Sydney, Australia
978-1-4799-7798-7/15/$31.00 ©2015 IEEE 900