Standardized Specifications, Development and Assessment of Large Morpho-Lexical Resources for Six Central and Eastern European Languages Dan Tufis Nancy Ide Tomaz Erjavec Romanian Academy Center for Artificial Intelligence Department of Computer Science Vassar College Institute Josef Stefan Ljubljana, Slovenia Bucharest, Romania Poughkeepsie, NY, USA [tomaz.erjavec@ijs.si] [tufis@racai.ro] [ide@cs.vassar.edu] Abstract This paper provides an overview of the harmonized language specifications for MULTEXT-East's six CEE languages, which include languages from the Romance, Finno-Ugric, and Slavic language families, and considers their use and distribution in lexicons for these languages. Because these language families include many features and properties not found in western European languages, such as heavy inflection and agglutination, adapting the specifications for western European languages to these languages posed many interesting and difficult problems. It describes the form and content of the six CEE language lexicons built on the basis of these specifications and provides quantitative assessment of the content, in order to compare the various languages. 1. Introduction MULTEXT-East was a project under the European Union Copernicus program whose goal was to develop language resources for six Central and Eastern European (CEE) languages (Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovene) and to adapt existing tools and standards to them (Erjavec, Ide, Petkevic, Véronis, 1996). The project built on and extends the MULTEXT project (Ide and Véronis, 1994), which developed a comprehensive set of corpus-annotation tools, including tools for text segmentation, stochastic part of speech tagging, and alignment of parallel texts. MULTEXT-East developed linguistic resources and created a multi-lingual, partially parallel corpus in the six CEE languages, a portion of which is annotated for part of speech and aligned (Erjavec & Ide, 1998). Because the overall goal of MULTEXT-East was to develop reusable resources, it was essential to establish standardized methods and specifications for the created resources. To this end, a harmonized set of specifications for lexicon encoding was developed for the six MULTEXT-East languages (Erjavec and Monachini, 1997), based on the specifications developed in the EAGLES project (Bel, Calzolari, and Monachini, 1996) and their extension by the MULTEXT project to six western European languages (English, French, Dutch, Italian, German, Spanish) (Monachini and Calzolari, 1996). Accommodating the different language families represented among the MULTEXT-East languages (Romance, Finno-Ugrec, and Slavic) demanded substantial assessment and modification of the pre-existing specifications, due to the need to accommodate features which appear rarely in western European languages, such as heavy inflection and agglutination. To validate the specifications, the MULTEXT-East project built lexicons for each of its six languages based on them, and used the information contained in them for the automatic tagging of a parallel corpus of Orwell's Nineteen Eighty-Four. The availability of a harmonized set of lexical specifications provides a common base for comparison of various statistical properties of lexemes in these languages, which has heretofore been impossible. This paper provides an overview of the harmonized language specifications for MULTEXT-East's six CEE languages and English, and considers their comparative use and distribution in lexicons and corpora for these languages. The paper is organized as follows: Section 2 provides a description of the morpho-syntactic specifications used in the lexicons of the project and discusses some encoding decisions. Section 3 describes the form and content of the six lexicons built on the basis of these specifications and provides quantitative assessment of the content, in order to compare the various languages. Section 4 briefly describes the corpora used in the project and provides information on the lexical items distribution. Finally, Section 5 summarizes our conclusions and suggests directions for future research. 2. Morpho-Lexical Specifications The MULTEXT-East lexical specifications describe the grammar of the morpho-syntactic descriptions (MSDs) used in the lexicons of the project. The development of harmonized lexical specifications for the six MULTEXT- East languages began with proposals developed in the EAGLES project (Bel, Calzolari, and Monachini, 1995) and the modifications proposed for six western European languages in the MULTEXT project (Monachini and Calzolari, 1996). These proposals were evaluated from the point of view of coverage for the six CEE languages. While adaptation for Romanian, a Romance language, was relatively straightforward, the other CEE languages posed many interesting and difficult problems and demanded substantial assessment and modification of the pre- existing specifications. The nucleus of common features isolated within MULTEXT for western European languages was assumed as the common ground for extension to the CEE languages. Specifications for information peculiar to the CEE languages were added as required, taking care that