Automatic Structuring of Written Texts Marek Veber, Aleˇs Hor´ak, Rostislav Julinek, Pavel Smrˇz Faculty of Informatics Masaryk University Botanick´a 68a, 60200 Brno, Czech Republic ⋆⋆ Abstract. This paper deals with automatic structuring and sentence boundary labelling in natural language texts. We describe the imple- mented structure tagging algorithm and heuristic rules that are used for automatic or semiautomatic labelling. Inside the detected sentence the algorithm performs a decomposition to clauses and then marks the parts of text which do not form a sentence, i.e. headings, signatures, tables and other structured data. We also pay attention to the processing of matched symbols in the text, especially to the analysis of direct speech notation. 1 Introduction In order to reduce the time and memory demands of syntactic analysis, POS tagging, aligning parallel corpora and other NLP tasks, one first needs to divide the analyzed text into parts which are then analysed separately. The first suitable division points are paragraph boundaries. After appropriate pre-analysis it is possible go even deeper and segment the text to sentences and then to particular clauses. The analysis is facilitated by demarcation of those word groups that cannot or should not be divided any further like data, personal names, URL addresses etc. In sentence boundary labelling we meet the problem of meaning ambiguity of the full-stop mark (a dot). Either it can denote a sentence end or it can be a part of an abbreviation or it can even bear both of these meanings (according to statistical results in English [3]: 90% — sentence end, 9.5% abbreviation and 0.5% both; in the Czech corpus DESAM we have 92% — sentence end, 5.5% abbreviation and 2.5% both meanings). Common approaches to solving the problem of labelling these hierarchical structures use regular expressions or finite automata with look-ahead that bear on several simple clues in text (like capitalisation) with a list of abbreviation and exceptions (see e.g. [1]). Other approaches are based on regressive trees [2] and artificial neuron networks [3] which make use of contextual information about POS tags in the surroundings of the potential structure boundary. However, those approaches cannot be easily applied in the analysis of Czech language because of the extent of the Czech tagset [4, 5]. ⋆⋆ The research is sponsored by the Czech Ministry of Education under the grant VS 97028.