Principled Hidden Tagset Design for Tiered Tagging of Hungarian Dan Tufi¸ s , Péter Dienes , Csaba Oravecz , Tamás Váradi Romanian Academy (RACAI) 13, ’13 septembrie’, 74311, Bucharest 5, Romania tufis@valhalla.racai.ro Research Institute for Linguistics Hungarian Academy of Sciences, Budapest {dienes,oravecz,varadi}@nytud.hu Abstract For highly inflectional languages, the number of morpho-syntactic descriptions (MSD), required to descriptionally cover the content of a word-form lexicon, tends to rise quite rapidly, approaching a thousand or even more set of distinct codes. For the purpose of automatic disambiguation of arbitrary written texts, using such large tagsets would raise very many problems, starting from implementation issues of a tagger to work with such a large tagsets to the more theory-based difficulty of sparseness of training data. Tiered tagging is one way to alleviate this problem by reformulating it in the following way: starting from a large set of MSDs, design a reduced tagset, Ctag-set, manageable for the current tagging technology. We describe the details of the reduced tagset design for Hungarian, where the MSD-set cardinality is several thousand. This means that designing a manageable C-tagset calls for severe reduction in the number of the MSD features, a process that requires careful evaluation of the features. 1. Introduction The combinatorial possibilities of inflection and deriva- tion in Hungarian morphology (for an estimate see (Ti- hanyi, 1996)) pose a challenge for corpus annotation in that it is difficult to establish a set of morphosyntactic descrip- tions that does justice to the rich morpho-syntactic informa- tion encoded within the words and at the same time remains computationally tractable. Tiered tagging (Tufi¸ s, 1998) is one way to alleviate this problem by reformulating it in the following way: starting from a large set of MSDs, design a reduced tagset, Ctag-set, manageable for the current tag- ging technology. The Ctag-set is used as a hidden tagset for the proper tagging of a text. This text, tagged in terms of the Ctag-set, is subject to a procedure aiming at recovering all (or most of) the information left out from the Ctag-set with respect to the MSD-set. In other words, each Ctag assigned to an item in the tagged text, is replaced with an appropriate and more informative descriptor, namely a MSD. In section 2. we will give an overview of the general principles one can follow in the design process. Section 3. presents the data analysis mostly along the lines described in (Váradi and Oravecz, 1999), but with much larger data sets and further investigations than those presented there. Section 4. will describe the process of reducing the MSD set into a Ctag set of manageable size. In section 5. we show some preliminary results on tagging accuracy and er- ror analysis comparing the performance of the tagging pro- cess with a verbose tagset and that of the tiered tagging with a more constrained tagset. Conclusions and suggestions for further work will follow in section 6. The author was supported by the Research Support Scheme of the Open Society Support Foundation, grant No.: 320/1998 2. General requirements for tiered tagging The design process of a reduced tagset has to consider two fundamental requirements: to identify and leave out the features/values in the MSDs which do not provide rel- evant clues for the contextual disambiguation, and to make it possible to recover as accurately and fast as possible the information eliminated in the previous phase. Fortunately, these two objectives, although not very simple to reach, are feasible and rewarding. The process is a trial-and-error one and relies both on human introspec- tion and evidence provided by the data analysis. One pos- sible approach would be to use an information loss-less al- gorithm to convert the MSD-set into a Ctag-set. Such an algorithm might reduce the size of the tagset with 10-20%, which is too little for a large initial tagset. However, modi- fying such an algorithm to allow for limited ambiguity (that is losing a limited amount of information), could result in a drastic reduction of the Ctag-set, up to a cardinality which is within the restrictions imposed by the available training data and computing power. The remaining problem is deciding what kind of am- biguities to accept in the output of such a generalization algorithm, so that by using a subsequent process we will be able to resolve them. In our approach, the reduced tagset is designed as a subsuming one for the MSD-set and as such once a Ctag was assigned to a lexical item in the tagged text, the recovery process has to identify the relevant MSD, out of the set of the MSDs that are subsumed by the Ctag in case. The recovering process could be lexicon driven (the lexicon would be encoded in terms of the large MSD-set) and can be conceived of as the intersection between the set of MSDs subsumed by a Ctag assigned to a wordform w, and the set of MSDs for w as provided by the lexicon (Tu- fi¸ s, 2000). This model can be compiled as a database, so that the recovery process could be a simple look-up in this