International Journal of Computer Applications (0975 – 8887) Volume 79 – No.1, October 2013 5 Text Summarization using Centrality Concept Ghaleb Al_Gaphari, Ph.D Computer Faculty Sana’a University, P.O. Box 1247 Sana’a, Yemen Fadl M. Ba-Alwi, Ph.D Computer Faculty Sana’a University, P. O. Box 1247 Sana’a, Yemen Aimen Moharram, Ph.D Computer Faculty Sana’a, University P.O. Box 1274 Sana’a, Yemen ABSTRACT: The amount of textual information available on the web is estimated by terra bytes. Therefore constructing a software program to summarize web pages or electronic documents would be a useful technique. Such technique would speed up of reading, information accessing and decision making process. This paper investigates a graph based centrality algorithm on Arabic text summarization problem (ATS). The graph based algorithm depends on extracting the most important sentences in a documents or a set of documents (cluster). The algorithm starts computing the similarity between two sentences and evaluating the centrality of each sentence in a cluster based on centrality graph. Then the algorithm extracts the most important sentences in the cluster to include them in a summary. The algorithm is implemented and evaluated by human participants and by an automatic metrics. Arabic NEWSWIRE-a corpus is used as a data set in the algorithm evaluation. The result was very promising. General Terms: AI Applications, NLP, Text Mining and AI Algorithms Keywords: Text Summarization, Text Mining and Centrality Concept 1. INTRODUCTION Information plays an important role in human daily life in different modern societies. Unfortunately, when large amounts of knowledge are produced and available through the web the process of efficient, effective distribution and accessing this valuable information becomes very critical. In fact, people face disorientation problem because of abundance of such information. Finding specific piece of information in this mass of data requires search engines to perform a remarkable task in providing users with subset of the original amount of information. Anyway, the subset retrieved by the search engines is still substantial in size. For example, at the time of writing the query “Summarization” in Google returned more than 9,090,000 results (as on 30 th Jan 2011 from Google). Users still need to manually scan through each single item of the information retrieved by the web search engines until the information of user interest is obtained. This boring task makes automatic text summarization the task of great importance as the users can then just read the summary and get overview of the document .In another word, document retrieval is not sufficient and user need a second level of abstraction to reduce this huge amount of data, user should have text summarization technique. Text summarization is one of the basic techniques in the area of text mining. Text mining is to concern with the task of extracting relevant information, from natural language text, and to search for interesting relationships between the extracted entities [1,23]. In more specific, text summarization is the process of extracting the most important information from a single document / multi- documents and producing a new short version for a particular task and user without losing any important contents or overall meaning from the original document/documents. This process could be seen as a text compression; therefore, text summarization system should define the important parts based on the purpose of the summary or user needs. Text summarization techniques could be classified into two classes based on the way which summarization is going to perform on the input document/documents. Such classes are extractive and abstractive summarizations. The main objective of an extractive text summarization technique is to select the important sentences from the original input text and combine them into a new shorter version. The importance sentences selection process takes place based on linguistic features, mathematical and statistical techniques. The summary generated based on the important sentences from the original input text may not be coherent. But it gives main idea about the content of the input text. While the main idea behind an abstractive text summarization technique is to understand the original input text and then create summaries with its own words. The technique usually, depends on linguistic models to generate new sentences from the original sentences through a process called paraphrasing .The technique includes syntactic and semantic studies for specific language and is useful for meaningful applications. In fact, abstractive text summarization technique is similar to the way a human creates a summary; unfortunately this is still a challenging task for a computer program. As the matter of fact, there are increased demands in developing technologies for automatic Arabic text summarization [14, 15]. Fortunately, there are several research projects to investigate and find out the techniques in automatically summarizing English documents as well as other European languages. Also, there is some software products have been developed for English text summarization such as MEAD summarization toolkit. Unfortunately, there is a limitation in both research papers and software development in terms of automatic Arabic text summarization. The main objective of this paper is to describe results of graph-based centrality algorithm implementation [25]. It is used to capture sentence centrality based on some centrality measures such as degree and lexis ranking. Also, the paper presents a graph representation for clustering documents, where each node of the graph represents a sentence and each edge represents the similarity relation between pairs of sentences. The summarization algorithm is evaluated based on two types of documents that are AFP Arabic newswire corpus provided by LDC, as well as summarization evaluations of Document Understanding Conference (DUC) [24]. 2. RELATED WORKS Over time there have been different methods and techniques to English text summarization and other European languages. Those methods and techniques are associated with single-