Named Entities in Czech: Annotating Data and Developing NE Tagger ⋆ Magda ˇ Sevˇ c´ ıkov´ a, Zdenˇ ek ˇ Zabokrtsk´ y, and Oldˇ rich Kr˚ uza Faculty of Mathematics and Physics, Charles University Malostransk´ e n´am. 25, CZ-11800 Prague, Czech Republic {sevcikova,zabokrtsky,kruza}@ufal.mff.cuni.cz Abstract. This paper deals with the treatment of Named Entities (NEs) in Czech. We introduce a two-level NE classification. We have used this classification for manual annotation of two thousand sentences, gain- ing more than 11,000 NE instances. Employing the annotated data and Machine-Learning techniques (namely the top-down induction of deci- sion trees), we have developed and evaluated a software system aimed at automatic detection and classification of NEs in Czech texts. 1 Introduction After the series of Message Understanding Conferences (MUC; [1]), processing of NE became a well established discipline within the NLP domain (see [2] for a survey of NE related research), usually motivated by the needs of Information Extraction, Question Answering, or Machine Translation. For English, one can find literature about attempts at rule-based solutions for the NE task as well as machine-learning approaches, be they dependent on the existence of labeled data (such as CoNLL-2003 shared task data), unsupervised (using redundancy in NE expressions and their contexts, see e.g. [3]) or a combination of both (such as [4], in which labeled data are used as a source of seed for an unsupervised procedure exploiting huge unlabeled data). For Czech, the situation is different. To our best knowledge, until the pre- sented work there have been no data with explicitly annotated NE instances available for Czech. Although there are several other types of available resources potentially usable for recognition and classification of NE (e.g. gazetteers, or technical lemma suffixes used at the morphological layer of PDT [5]), we have not found any published attempt concerning development of NE taggers for Czech. 1 This paper is structured as follows: in Section 2 we introduce our classification of NE, which we have used for annotating sample sentences as described in ⋆ The research reported on in this paper was supported by the projects 1ET101120503, MSM0021620838, MSMT CR LC536, GD201/05/H014, and GA UK 643/2007. 1 Even if some approaches developed for English are claimed to be language indepen- dent, it is obvious that they cannot be straightforwardly applied to Czech because of its rich inflection.