Standardized Specifications, Development and Assessment of Large
Morpho-Lexical Resources for Six Central and Eastern European
Languages
Dan Tufis Nancy Ide Tomaz Erjavec
Romanian Academy
Center for Artificial Intelligence
Department of Computer Science
Vassar College
Institute Josef Stefan
Ljubljana, Slovenia
Bucharest, Romania Poughkeepsie, NY, USA [tomaz.erjavec@ijs.si]
[tufis@racai.ro] [ide@cs.vassar.edu]
Abstract
This paper provides an overview of the harmonized language
specifications for MULTEXT-East's six CEE languages, which
include languages from the Romance, Finno-Ugric, and Slavic
language families, and considers their use and distribution in
lexicons for these languages. Because these language families
include many features and properties not found in western
European languages, such as heavy inflection and
agglutination, adapting the specifications for western
European languages to these languages posed many
interesting and difficult problems. It describes the form and
content of the six CEE language lexicons built on the basis of
these specifications and provides quantitative assessment of
the content, in order to compare the various languages.
1. Introduction
MULTEXT-East was a project under the European Union
Copernicus program whose goal was to develop language
resources for six Central and Eastern European (CEE)
languages (Bulgarian, Czech, Estonian, Hungarian,
Romanian, Slovene) and to adapt existing tools and
standards to them (Erjavec, Ide, Petkevic, Véronis, 1996).
The project built on and extends the MULTEXT project
(Ide and Véronis, 1994), which developed a comprehensive
set of corpus-annotation tools, including tools for text
segmentation, stochastic part of speech tagging, and
alignment of parallel texts. MULTEXT-East developed
linguistic resources and created a multi-lingual, partially
parallel corpus in the six CEE languages, a portion of
which is annotated for part of speech and aligned (Erjavec
& Ide, 1998).
Because the overall goal of MULTEXT-East was to
develop reusable resources, it was essential to establish
standardized methods and specifications for the created
resources. To this end, a harmonized set of specifications
for lexicon encoding was developed for the six
MULTEXT-East languages (Erjavec and Monachini,
1997), based on the specifications developed in the
EAGLES project (Bel, Calzolari, and Monachini, 1996)
and their extension by the MULTEXT project to six
western European languages (English, French, Dutch,
Italian, German, Spanish) (Monachini and Calzolari,
1996). Accommodating the different language families
represented among the MULTEXT-East languages
(Romance, Finno-Ugrec, and Slavic) demanded substantial
assessment and modification of the pre-existing
specifications, due to the need to accommodate features
which appear rarely in western European languages, such
as heavy inflection and agglutination. To validate the
specifications, the MULTEXT-East project built lexicons
for each of its six languages based on them, and used the
information contained in them for the automatic tagging
of a parallel corpus of Orwell's Nineteen Eighty-Four.
The availability of a harmonized set of lexical
specifications provides a common base for comparison of
various statistical properties of lexemes in these
languages, which has heretofore been impossible. This
paper provides an overview of the harmonized language
specifications for MULTEXT-East's six CEE languages
and English, and considers their comparative use and
distribution in lexicons and corpora for these languages.
The paper is organized as follows: Section 2 provides a
description of the morpho-syntactic specifications used in
the lexicons of the project and discusses some encoding
decisions. Section 3 describes the form and content of the
six lexicons built on the basis of these specifications and
provides quantitative assessment of the content, in order
to compare the various languages. Section 4 briefly
describes the corpora used in the project and provides
information on the lexical items distribution. Finally,
Section 5 summarizes our conclusions and suggests
directions for future research.
2. Morpho-Lexical Specifications
The MULTEXT-East lexical specifications describe the
grammar of the morpho-syntactic descriptions (MSDs)
used in the lexicons of the project. The development of
harmonized lexical specifications for the six MULTEXT-
East languages began with proposals developed in the
EAGLES project (Bel, Calzolari, and Monachini, 1995)
and the modifications proposed for six western European
languages in the MULTEXT project (Monachini and
Calzolari, 1996). These proposals were evaluated from the
point of view of coverage for the six CEE languages.
While adaptation for Romanian, a Romance language, was
relatively straightforward, the other CEE languages posed
many interesting and difficult problems and demanded
substantial assessment and modification of the pre-
existing specifications.
The nucleus of common features isolated within
MULTEXT for western European languages was assumed
as the common ground for extension to the CEE
languages. Specifications for information peculiar to the
CEE languages were added as required, taking care that