I. Bloch and R.M. Cesar, Jr. (Eds.): CIARP 2010, LNCS 6419, pp. 261–268, 2010.
© Springer-Verlag Berlin Heidelberg 2010
Text Segmentation by Clustering Cohesion
Raúl Abella Pérez and José Eladio Medina Pagola
Advanced Technologies Application Centre (CENATAV), 7a #21812 e/ 218 y 222, Rpto.
Siboney, Playa, C.P. 12200, Ciudad de la Habana, Cuba
{rabella,jmedina}@cenatav.co.cu
Abstract. An automatic linear text segmentation in order to detect the best topic
boundaries is a difficult and very useful task in many text processing systems.
Some methods have tried to solve this problem with reasonable results, but they
present some drawbacks as well. In this work, we propose a new method, called
ClustSeg, based on a predefined window and a clustering algorithm to decide
the topic cohesion. We compare our proposal against the best known methods,
with a better performance against these algorithms.
1 Introduction
Text segmentation is the task of splitting a document into syntactical units
(paragraphs, sentences, words, etc.) or semantic blocks, usually based on topics. The
difficulty of text segmentation mainly depends on the characteristics of documents
which will be segmented (i.e. scientific texts, news, etc.) and the segmentation outputs
(e.g. topics, paragraphs, sentences, etc.). There are different approaches to solve this
problem; one is a linear segmentation, where the document is split into a linear
sequence of adjacent segments. Another approach is a hierarchical segmentation; the
outputs of these algorithms try to identify the document structure, usually chapters
and multiple levels of sub-chapters [6].
There are many applications for text segmentation. Many tools for automatic text
indexing and information retrieval can be improved by a text segmentation process.
For example, when segmenting news of broadcast story transcriptions, a topic
segmentation takes a crucial role, because a topic segmentation can be used for
retrieving passages more linked to the query made by the user, instead of the full
document [9], [11]. In tasks of summary generation, text segmentation by topics can be
used to select blocks of texts containing the main ideas for the summary requested [9].
Analyzing the performance of different methods of text segmentation by topics [7],
[8], [9], [10] we observed some difficulties as, for instance, wrong interruptions of
segments, leaving out sentences or paragraphs which belong to the segments, and
generating segments with incomplete information. When these situations happen,
spurious segments are obtained. Another difficulty we observed is that those methods
are not able to identify the true relations amongst paragraphs of each segment
considering natural topic cohesion.
In this work we propose an algorithm for linear text segmentation of multi-
paragraphs based on topics, called ClustSeg, defined as a solution of the aforementioned
difficulties. This method is based on a window approach to identify boundaries of