I. Bloch and R.M. Cesar, Jr. (Eds.): CIARP 2010, LNCS 6419, pp. 261–268, 2010. © Springer-Verlag Berlin Heidelberg 2010 Text Segmentation by Clustering Cohesion Raúl Abella Pérez and José Eladio Medina Pagola Advanced Technologies Application Centre (CENATAV), 7a #21812 e/ 218 y 222, Rpto. Siboney, Playa, C.P. 12200, Ciudad de la Habana, Cuba {rabella,jmedina}@cenatav.co.cu Abstract. An automatic linear text segmentation in order to detect the best topic boundaries is a difficult and very useful task in many text processing systems. Some methods have tried to solve this problem with reasonable results, but they present some drawbacks as well. In this work, we propose a new method, called ClustSeg, based on a predefined window and a clustering algorithm to decide the topic cohesion. We compare our proposal against the best known methods, with a better performance against these algorithms. 1 Introduction Text segmentation is the task of splitting a document into syntactical units (paragraphs, sentences, words, etc.) or semantic blocks, usually based on topics. The difficulty of text segmentation mainly depends on the characteristics of documents which will be segmented (i.e. scientific texts, news, etc.) and the segmentation outputs (e.g. topics, paragraphs, sentences, etc.). There are different approaches to solve this problem; one is a linear segmentation, where the document is split into a linear sequence of adjacent segments. Another approach is a hierarchical segmentation; the outputs of these algorithms try to identify the document structure, usually chapters and multiple levels of sub-chapters [6]. There are many applications for text segmentation. Many tools for automatic text indexing and information retrieval can be improved by a text segmentation process. For example, when segmenting news of broadcast story transcriptions, a topic segmentation takes a crucial role, because a topic segmentation can be used for retrieving passages more linked to the query made by the user, instead of the full document [9], [11]. In tasks of summary generation, text segmentation by topics can be used to select blocks of texts containing the main ideas for the summary requested [9]. Analyzing the performance of different methods of text segmentation by topics [7], [8], [9], [10] we observed some difficulties as, for instance, wrong interruptions of segments, leaving out sentences or paragraphs which belong to the segments, and generating segments with incomplete information. When these situations happen, spurious segments are obtained. Another difficulty we observed is that those methods are not able to identify the true relations amongst paragraphs of each segment considering natural topic cohesion. In this work we propose an algorithm for linear text segmentation of multi- paragraphs based on topics, called ClustSeg, defined as a solution of the aforementioned difficulties. This method is based on a window approach to identify boundaries of