Physica A 387 (2008) 6411–6420 Contents lists available at ScienceDirect Physica A journal homepage: www.elsevier.com/locate/physa Equilibrium and dynamic methods when comparing an English text and its Esperanto translation M. Ausloos GRAPES, U. Liege, B5 Sart-Tilman, B-4000 Liege, Belgium article info Article history: Received 26 February 2008 Received in revised form 27 May 2008 Available online 23 July 2008 Keywords: Text Language Translation Zipf Grassberger–Procaccia Time series abstract A comparison of two English texts written by Lewis Carroll, one (Alice in Wonderland), also translated into Esperanto, the other (Through the Looking Glass) are discussed in order to observe whether natural and artificial languages significantly differ from each other. One dimensional time series like signals are constructed using only word frequencies (FTS) or word lengths (LTS). The data is studied through (i) a Zipf method for sorting out correlations in the FTS and (ii) a Grassberger–Procaccia (GP) technique based method for finding correlations in LTS. The methods correspond to an equilibrium and a dynamic approach respectively to human texts features. There are quantitative statistical differences between the original English text and its Esperanto translation, but the qualitative differences are very minutes. However different power laws are observed with characteristic exponents for the ranking properties, and the phase space attractor dimensionality. The Zipf exponent can take values much less than unity (∼0.50 or 0.30) depending on how a sentence is defined. This variety in exponents can be conjectured to be an intrinsic measure of the book style or purpose, rather than the language or author vocabulary richness, since a similar exponent is obtained whatever the text. Moreover the attractor dimension r is a simple function of the so called phase space dimension n, i.e., r = n λ , with λ = 0.79. Such an exponent could also be conjectured to be a measure of the author style versatility, — here well preserved in the translation. © 2008 Elsevier B.V. All rights reserved. 1. Introduction Human written languages are systems usually composed of a large number of internal components (the words, punctuation signs, and blanks in printed texts) which obey rules (grammar) [1,2]. Relevant questions pertain to the life time, concentration, distribution,.. complexity of these and their relations between each others. Thus human language is a new emerging field for the application of methods from the physical sciences in order to achieve a deeper understanding of linguistic complexity [3–9]. One should distinguish two main frameworks. On one hand, language developments seem to be understandable through competitions, like in Ising models, and in self-organized systems. Their diffusion seems similar to percolation and nucleation- growth problems taking into account the existence of different time scales, for inter- and intra-effects. The other frame is somewhat older and originates from more classical linguistics studies; it pertains to the content and meanings [1,2]. This latter case is of interest here and the main subject of the report, within a statistical physics framework. Concerning the internal structure of a text, supposedly characterized by the language in which it is written, it is well known that a text can be mapped into a signal, of course first through the alphabet characters. However it can be also reduced to less abundant symbols through some threshold, like a time series, which can be a list of +1 and −1, or sometimes 0. Thereafter one could apply at this stage many techniques of signal analysis. E-mail address: marcel.ausloos@ulg.ac.be. 0378-4371/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.physa.2008.07.016