SegGen: a Genetic Algorithm for Linear Text Segmentation S.Lamprier, T.Amghar, B.Levrat and F.Saubion LERIA, Universit´ e d’Angers 2, Bd Lavoisier 49045 Angers (France) {lamprier,amghar,levrat,saubion}@info.univ-angers.fr Abstract This paper describes SegGen, a new algorithm for linear text segmentation on general corpuses. It aims to segment texts into thematic homogeneous parts. Several existing methods have been used for this purpose, based on a sequential creation of boundaries. Here, we propose to consider bound- aries simultaneously thanks to a genetic algorithm. SegGen uses two criteria: maximization of the in- ternal cohesion of the formed segments and min- imization of the similarity of the adjacent seg- ments. First experimental results are promising and SegGen appears to be very competitive compared with existing methods. 1 Introduction The purpose of automatic text segmentation is to identify the most important thematic breaks in a document in order to cut it into homogeneous units, disconnected from other ad- jacent parts [Salton et al., 1996]. More precisely, segmen- tation partitions a text by determining boundaries between contiguous segments related to different topics, defining so semantically coherent parts of text that are sufficiently large to expose some aspect of a given subject. Thematic segmen- tation of texts can also be seen as a grouping process of basic units (words, sentences, paragraphs...) in order to highlight local semantical coherences [Kozima, 1993]. The granularity level of the segmentation depends on the size of the units. The increasing interest in text segmentation is mainly ex- plained by the number of its applications such as text align- ment, document summarization, information extraction, or information retrieval [Baeza-Yates and Ribeiro-Neto, 1999]. Text segmentation can be indeed very useful for these tasks by providing the structure of a document in terms of the dif- ferent topics it covers [McDonald and Chen, 2002]. Many segmentation methods have been proposed and we focus here on the most general and significant methods that rely on statistical approaches such as Text Tiling [Hearst, 1997], C99 [Choi, 2000], DotPlotting [Reynar, 2000] or Seg- menter [Kan et al., 1998]. These methods perform an analysis of the distribution of the words in the text, in order to deter- mine the thematic changes by means of lexical inventory vari- ations in fixed size windows and thus create boundaries in the text where the local cohesion is the lowest. In this paper, we introduce SegGen a genetic algorithm to achieve a statistical linear segmentation of texts. Section 2 presents the main motivations of our work. Our segmentation algorithm is described in section 3. Then, section 4 describes an experimental study of the algorithm in order to tune its pa- rameters. Finally, section 5 evaluates SegGen by comparing it with other segmentation systems. 2 Motivations and Preliminary Works In most of existing segmentation approaches, the relation- ships between sentences are usually very local. For exam- ple, the methods addressing segmentation by means of lex- ical chains 1 mainly use the repetitions of the terms in order to define thematic boundaries. More precisely, these meth- ods cut the texts where the number of lexical chains is mini- mal. Nevertheless, the context of these multiple occurrences is not addressed neither the significance of simultaneous lex- ical chains at a given position in the text. According to the preceding remark, we attempted to pro- pose an alternative approach to the segmentation of texts by taking into account a more complete view of the texts. In [Bellot, 2000], P. Bellot has shown that there exists a strong relationship between clustering and text segmentation. His hypothesis states that it is possible to set a boundary between two adjacent sentences belonging to two different semantic classes. This assumption seems too strong since text segmen- tation not only depends on the similarity of the sentences, but must also consider their layout in the text. Indeed, discourse structures of documents may be very diverse and a part of a text related to a particular topic may contain sentences de- viating somewhat from it. These sentences are likely to be classified differently from their neighbors. However, this cer- tainly does not imply that a boundary has necessarily to be set there. Moreover, this assumption is too dependent on the chosen clustering mechanism, no existing clustering method being fully reliable. 1 Lexical chains are formed between two occurrences of a same term [Galley et al., 2003] [Utiyama and Isahara, 2001] and occa- sionally between synonyms and terms having statistical associations such as generalization/specialization or part-whole/whole-part rela- tionships [Morris and Hirst, 1991] IJCAI-07 1647