An Automatic Linguistics Approach for Persian Document Summarization Hossein Kamyar, Mohsen Kahani, Mohsen Kamyar, Asef Poormasoomi Web Technology Lab, Ferdowsi University of Mashhad Mashhad, Iran Hossein.kamyar@stu-mail.um.ac.ir ,kahani@um.ac.ir ,mkamyar@stu-mail.um.ac.ir ,As.poormasoomi@stu-mail.um.ac.ir Abstract __ In this paper we propose a novel technique for summarizing a text based on the linguistics properties of text elements and semantic chains among them. In most summarization approaches, the major consideration is the statistical properties of text elements such as term frequency. Here we use centering theory which helps us to recognize semantic chains in a text, for proposing a new automatic single document summarization approach. For processing a text by centering theory and extracting a coherent summery, a processing pipeline should be constructed. This pipeline consists of several components such as co-reference resolution, semantic role labeling and POS [Part of speech] tagging. Keywords- Single-document summarization, Centering Theory, LSI, Extractive, Persian I. INTRODUCTION Automatic documents summarization is an important tool in the age of explosive growth of data. According to [1] summary refers to a generated text from one or more texts and it consists of important concepts of the texts. This generated text should not be bigger than half of the source texts. This simple interpretation involves main properties of a summary: (1) summary of one or more texts, (2) major information of the source texts, and (3) short. Investigations about extracting important and salient knowledge from a text are subject of single document summarization [2]. The researches in this field can be categorized into extractive and abstractive summarization. Extractive summery means returning of some sentences as important sections, and abstractive summary means representation of internal knowledge of a text using possibly different wording [2]. In this work, we propose an extractive single document summarization approach using a combination of a linguistics theory (Centering Theory) and some statistical parameters of text. The proposed method tries to address the current challenges of summarization approaches: (1) Longer length of the extracted sentences than the average length of source sentences, (2) Dispersion of data in the text, (3) Similarity of information between extracted sentences, (4) Lack of coherence in generated summary, (5) Dependence of the summary to the statistical parameters of the text elements such as term frequency and etc. For solving the first problem, we used statistical parameters and for other problems we used the centering theory. The remainder of the paper is organized as follows: Section 2 discusses related works in single document summarization in English and Persian as well as the literature review on centering theory. In Section 3, we describe the proposed method in details. The experimental results are presented in Section 4, and finally conclusion is drawn and future works are discussed. II. RELATED WORKS A. Extractive single document summarization Many approaches are proposed for single document summarization each of which belong to one of computational text categories such as machine learning, genetic algorithms, neural network, fuzzy, clustering and statistics. On English, in investigation [3], LSI algorithm, as a clustering approach, has been utilized as a logarithmic evidence for term weighting. In [2] with the use of a neural network on DUC2001 dataset, first sentence of each news text as the most important of the sentences is recognized. Also in [4] by using of Centering theory, a summarization method is represented. In this method, CB [Backward looking center] parameter for each sentence is computed and then similar CBs in the whole text are enumerated. Next, sentences that include CB, which belongs to numerous CBs, are selected as important sentences. Article [5] constructs utterance topic model to generating a coherent summary with the utilization of centering theory and LDA [Latent Dirichlet Allocation]. The idea that centering theory can recognize coherence in the text is the major contribution of this paper. This paper focuses on DUC2005 [Document Understanding Conference], TAC2008 [Text Analysis Conference], TAC2009 and it reports good results for summarization. Unlike English-written text summarization methods, summarization of single and multiple documents written in Persian language is a relatively new field of research. The first work on Persian Language is FarsiSum in 2004[6]. It is a Web based application programmed in Perl and based on SweSum [7]. FarsiSum selects sentences from documents with the main body of language independent modules implemented in SweSum. It has added the Persian stop-list in Unicode format and has adapted the interface modules to accept Persian texts. The next work was done by Karimi and Shamsfard [8]. It is a Persian single document summarization method based on lexical chains and graph based methods. Zamanifar in [9] proposed an integrated method for Persian text summarization which combines the term co-occurrence property and conceptually related feature of Persian language. B. Centering Theory Centering theory [10] is one of the components of general centralization and coherent discourse theory of Grosz and Sidner, which is about local coherence and salience. This theory has been formulated by [11] and is supported by empirical evidences in [12]. Since this theory has good potential for recognizing coherence and 2011 International Conference on Asian Language Processing 978-0-7695-4554-7/11 $26.00 © 2011 IEEE DOI 10.1109/IALP.2011.52 141