Statistics of Morphological Finite-State Transition Networks Obey the Power Law Alexander Troussov, Brian O’Donovan IBM Dublin Software Lab, Airways Ind. Est., Cloghran, Dublin 17, Ireland {atrousso, Brian_ODonovan}@ie.ibm.com Abstract Finite-state devices are widely used in natural lan- guage processing, yet little if anything is known about metrics and topology of finite-state transition graphs. Here we study numerically the structure of directed state transition graphs for several types of finite-state devices representing morphology of 16 languages. In all experiments we have found that distribution of in- coming and outcoming links is highly skewed and is modeled well by the power law, not by Poisson distri- bution typical of classical random graphs. The power- law form of degree distribution is regarded as a signa- ture of self-organizing systems, and it has been previ- ously found for numerous real world networks in communication, biology, social sciences and econom- ics. 1 Introduction Finite-state devices, including finite-state automata and transducers, are widely used in natural language processing to produce morphological information. Constructed as applications of formal finite-state techniques, they can be considered as networks where nodes represent states and arcs (labeled by characters) represent the transitions. Examination of their graph- metrics and topology is essential for efficient com- puter implementation of finite-state processing, in- cluding per-node optimization. It might also lead to new quantitative methods in language typology as we argue below. In computational linguistics semantic and co- occurrence networks were already studied. In these networks nodes correspond to words. In semantic networks the links show semantic relations between words. In co-occurrence networks links represent the fact that words occur beside each other in a corpus. We are not aware of similar investigations applied to finite-state transition networks, representing language morphology. In [Leslie 1995] the average out-degree of (random, non-deterministic) automata is shown to be a good predictor for the expected number of states in the determinized automaton, the same technique is used in [van Noord 2000]. In the Introduction we remind the basics of finite- state processing in morphological applications and provide the rationale – why applying of modern ran- dom networks theory might be of interest for applica- tions in finite-state processing. In the second section – Random Networks and Re- lated Work – we briefly outline methods and results of this relatively new theory to identify which of them are related to the study of finite-state devices. We ar- gue that one particular metric studied for random net- works - degree distribution – is of special interest for the initial investigation. In the third section we describe the morphological data used in our experiments, and in the fourth - our cross-linguistic experimental study of the degree dis- tribution, which we have found well approximated by the power-law. In Discussion we put forward additional considera- tions about consequences of power-law behavior in view of our experiments. 1.1 Finite-state devices used in morphology In our experiments we analyzed two major types of finite-state devices, used in natural language process- ing for word verification and producing morphologi- cal information. In both devices word verification is regarded as a process of moving from an initial input state to an acceptance state in a space of character transitions. Finite-state automata. The input list of words (sur- face forms), is compiled into a letter tree, which is then minimized to reuse common postfixes. Each word can be loaded with additional information (its part-of-speech categories, etc.), which can be attached to the leaves (the terminals) of the letter tree. In this case two postfixes can be merged only if they lead to exactly the same information. Finite-state automata (FSAs) constructed this way, are acyclic and determi- nistic (for each state and each character there can be only one or zero output links labeled by this charac- ter). Lexical transducers. In our experiments we also ana- lyzed IBM lexical transducers that implement two- level morphology rules. Some of them have cycles and are non-deterministic. – 1 –