An Automatic Linguistics Approach for Persian Document Summarization
Hossein Kamyar, Mohsen Kahani, Mohsen Kamyar, Asef Poormasoomi
Web Technology Lab, Ferdowsi University of Mashhad
Mashhad, Iran
Hossein.kamyar@stu-mail.um.ac.ir ,kahani@um.ac.ir ,mkamyar@stu-mail.um.ac.ir ,As.poormasoomi@stu-mail.um.ac.ir
Abstract __ In this paper we propose a novel technique for
summarizing a text based on the linguistics properties of text
elements and semantic chains among them. In most
summarization approaches, the major consideration is the
statistical properties of text elements such as term frequency.
Here we use centering theory which helps us to recognize
semantic chains in a text, for proposing a new automatic
single document summarization approach. For processing a
text by centering theory and extracting a coherent summery,
a processing pipeline should be constructed. This pipeline
consists of several components such as co-reference
resolution, semantic role labeling and POS [Part of speech]
tagging.
Keywords- Single-document summarization, Centering
Theory, LSI, Extractive, Persian
I. INTRODUCTION
Automatic documents summarization is an important
tool in the age of explosive growth of data. According to
[1] summary refers to a generated text from one or more
texts and it consists of important concepts of the texts.
This generated text should not be bigger than half of the
source texts. This simple interpretation involves main
properties of a summary: (1) summary of one or more
texts, (2) major information of the source texts, and (3)
short.
Investigations about extracting important and salient
knowledge from a text are subject of single document
summarization [2]. The researches in this field can be
categorized into extractive and abstractive summarization.
Extractive summery means returning of some sentences as
important sections, and abstractive summary means
representation of internal knowledge of a text using
possibly different wording [2].
In this work, we propose an extractive single
document summarization approach using a combination
of a linguistics theory (Centering Theory) and some
statistical parameters of text. The proposed method tries to
address the current challenges of summarization
approaches: (1) Longer length of the extracted sentences
than the average length of source sentences, (2)
Dispersion of data in the text, (3) Similarity of
information between extracted sentences, (4) Lack of
coherence in generated summary, (5) Dependence of the
summary to the statistical parameters of the text elements
such as term frequency and etc. For solving the first
problem, we used statistical parameters and for other
problems we used the centering theory.
The remainder of the paper is organized as follows:
Section 2 discusses related works in single document
summarization in English and Persian as well as the
literature review on centering theory. In Section 3, we
describe the proposed method in details. The experimental
results are presented in Section 4, and finally conclusion is
drawn and future works are discussed.
II. RELATED WORKS
A. Extractive single document summarization
Many approaches are proposed for single document
summarization each of which belong to one of
computational text categories such as machine learning,
genetic algorithms, neural network, fuzzy, clustering and
statistics. On English, in investigation [3], LSI algorithm,
as a clustering approach, has been utilized as a
logarithmic evidence for term weighting. In [2] with the
use of a neural network on DUC2001 dataset, first
sentence of each news text as the most important of the
sentences is recognized. Also in [4] by using of Centering
theory, a summarization method is represented. In this
method, CB [Backward looking center] parameter for
each sentence is computed and then similar CBs in the
whole text are enumerated. Next, sentences that include
CB, which belongs to numerous CBs, are selected as
important sentences. Article [5] constructs utterance topic
model to generating a coherent summary with the
utilization of centering theory and LDA [Latent Dirichlet
Allocation]. The idea that centering theory can recognize
coherence in the text is the major contribution of this
paper. This paper focuses on DUC2005 [Document
Understanding Conference], TAC2008 [Text Analysis
Conference], TAC2009 and it reports good results for
summarization.
Unlike English-written text summarization methods,
summarization of single and multiple documents written in
Persian language is a relatively new field of research.
The first work on Persian Language is FarsiSum in
2004[6]. It is a Web based application programmed in Perl
and based on SweSum [7]. FarsiSum selects sentences
from documents with the main body of language
independent modules implemented in SweSum. It has
added the Persian stop-list in Unicode format and has
adapted the interface modules to accept Persian texts. The
next work was done by Karimi and Shamsfard [8]. It is a
Persian single document summarization method based on
lexical chains and graph based methods. Zamanifar in [9]
proposed an integrated method for Persian text
summarization which combines the term co-occurrence
property and conceptually related feature of Persian
language.
B. Centering Theory
Centering theory [10] is one of the components of
general centralization and coherent discourse theory of
Grosz and Sidner, which is about local coherence and
salience. This theory has been formulated by [11] and is
supported by empirical evidences in [12]. Since this
theory has good potential for recognizing coherence and
2011 International Conference on Asian Language Processing
978-0-7695-4554-7/11 $26.00 © 2011 IEEE
DOI 10.1109/IALP.2011.52
141