2011 IEEE International Conference on Fuzzy Systems June 27-30, 2011, Taipei, Taiwan 978-1-4244-7317-5/11/$26.00 ©2011 IEEE A Computational Linguistic Approach for the Identification of Translator Stylometry using Arabic-English Text Heba El-Fiqi School of Engineering and Information Technology University of New South Wales ADFA campus Canberra, Australia H.El-Fiqi@student.adfa.edu.au Eleni Petraki Faculty of Arts and Design University of Canberra Canberra, Australia Eleni.Petraki@canberra.edu.au Hussein A. Abbass School of Engineering and Information Technology University of New South Wales ADFA campus Canberra, Australia H.Abbass@adfa.edu.au Abstract- Translator Stylometry is a small but growing area of research in computational linguistics. Despite the research proliferation on the wider research field of authorship attribution using computational linguistics techniques, the translator stylometry problem is more challenging and there is no sufficient literature on the topic. Some authors even claimed that this problem does not have a solution; a claim we will challenge in this paper. We present an innovative set of translator stylometric features that can be used as signatures to detect and identify translators. The features are based on the concept of network motifs: small graph local substructures which have been used successfully in characterizing global network dynamics. The text is transformed into a network, where words become nodes and their adjacencies in a sentence are represented through links. Motifs of size 3 are then extracted from this network and their distribution is used as a signature for the corresponding translator. We then investigate the impact of sample size, method of normalization and imbalance dataset on classification accuracy. We also adopt the Fuzzy Lattice Reasoning Classifier (FLR) among others, where FLR achieved the best performance with a classification accuracy reaching the 70% mark. Keywords-component; Translator Stylometry; Authorship Attributions; Network Motifs; Decision Tree Analysis; Fuzzy Classifier; Computational linguistics; Arabic-English Corpus. I. INTRODUCTION Identifying the author of a text is an important area of research in “Computational Linguistics” [1, 2]. There are many studies in “Authorship Attributions” [1-14]. These studies use the stylometric features of the authors to identify the original authors. These stylometric features include lexical, character, syntactic, semantic, and application-specific features [1-14]. On the one hand, the problem of authorship attributions is a difficult one and research challenges remain to exist in this area. On the other hand, the sub-problem of translator attributions is even harder and no solution for it exists so far. Translation is a fascinating topic. While the original writer of a text had a specific mental picture in her mind and an intended message to be communicated through the text, a translator faces a different type of challenge. Successful translation necessitates that the translator needs to form the same mental picture as the original author of the text. Good translation does not stop at the level of mapping words, but extends to mapping meaning, mental pictures, imagination, and feelings. This is called the “loyalty” dilemma, where a wide discussion in the literature exists on the importance of maintaining the spirit of the original work. If we compare author attributions to translator attributions, we find that the former is expected to have more signatures or discriminatory factors representing the choices made by the authors. Authors have many more degrees of freedom, where they can build their own identity as authors. Translators have less. Being constrained with the original text is a non-trivial limitation. This feature alone makes translator attributions a more difficult problem than author attributions. Nevertheless, we conjecture that translators attempt to have their own touch, signatures that can be used to detect who translated what. This is the hypothesis we hold in this paper. The problem of how to identify translator stylometry is under-studied in the literature; probably because it is a harder problem. Some argue that translated work is considered as the original author’s literature work rather than the translator’s own work; however, no one can ignore the fact that translators are individuals [15]; they make personal choices which can affect the translation process. This is what we call “Translator Stylometry”. In this paper, we are going to introduce a new method that uses network motifs to identify the difference in translator’s style. Our hypothesis is reformulated as “network motifs can be used to differentiate between different translators based on their own writing stylometrics”. To test this hypothesis, we represent our datasets as networks. This is done by generating a word adjacency network for each piece of work by a translator in the dataset. To analyze and compare two networks, we can use their global statistical features; these include Shortest-path length, global centrality, clustering coefficient, etc.., or their structural design principles like the network motifs. Network motifs which are initially introduced by Ron Milo et al. [16] are patterns 2039