Physica A 387 (2008) 6411–6420
Contents lists available at ScienceDirect
Physica A
journal homepage: www.elsevier.com/locate/physa
Equilibrium and dynamic methods when comparing an English text and
its Esperanto translation
M. Ausloos
GRAPES, U. Liege, B5 Sart-Tilman, B-4000 Liege, Belgium
article info
Article history:
Received 26 February 2008
Received in revised form 27 May 2008
Available online 23 July 2008
Keywords:
Text
Language
Translation
Zipf
Grassberger–Procaccia
Time series
abstract
A comparison of two English texts written by Lewis Carroll, one (Alice in Wonderland), also
translated into Esperanto, the other (Through the Looking Glass) are discussed in order to
observe whether natural and artificial languages significantly differ from each other. One
dimensional time series like signals are constructed using only word frequencies (FTS) or
word lengths (LTS). The data is studied through (i) a Zipf method for sorting out correlations
in the FTS and (ii) a Grassberger–Procaccia (GP) technique based method for finding
correlations in LTS. The methods correspond to an equilibrium and a dynamic approach
respectively to human texts features. There are quantitative statistical differences between
the original English text and its Esperanto translation, but the qualitative differences are
very minutes. However different power laws are observed with characteristic exponents
for the ranking properties, and the phase space attractor dimensionality. The Zipf exponent
can take values much less than unity (∼0.50 or 0.30) depending on how a sentence is
defined. This variety in exponents can be conjectured to be an intrinsic measure of the book
style or purpose, rather than the language or author vocabulary richness, since a similar
exponent is obtained whatever the text. Moreover the attractor dimension r is a simple
function of the so called phase space dimension n, i.e., r = n
λ
, with λ = 0.79. Such an
exponent could also be conjectured to be a measure of the author style versatility, — here
well preserved in the translation.
© 2008 Elsevier B.V. All rights reserved.
1. Introduction
Human written languages are systems usually composed of a large number of internal components (the words,
punctuation signs, and blanks in printed texts) which obey rules (grammar) [1,2]. Relevant questions pertain to the life
time, concentration, distribution,.. complexity of these and their relations between each others. Thus human language is a
new emerging field for the application of methods from the physical sciences in order to achieve a deeper understanding of
linguistic complexity [3–9].
One should distinguish two main frameworks. On one hand, language developments seem to be understandable through
competitions, like in Ising models, and in self-organized systems. Their diffusion seems similar to percolation and nucleation-
growth problems taking into account the existence of different time scales, for inter- and intra-effects. The other frame is
somewhat older and originates from more classical linguistics studies; it pertains to the content and meanings [1,2]. This
latter case is of interest here and the main subject of the report, within a statistical physics framework.
Concerning the internal structure of a text, supposedly characterized by the language in which it is written, it is well
known that a text can be mapped into a signal, of course first through the alphabet characters. However it can be also
reduced to less abundant symbols through some threshold, like a time series, which can be a list of +1 and −1, or sometimes
0. Thereafter one could apply at this stage many techniques of signal analysis.
E-mail address: marcel.ausloos@ulg.ac.be.
0378-4371/$ – see front matter © 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.physa.2008.07.016