International Journal of Computer Applications (0975 – 8887) Volume 47– No.23, June 2012 7 A Feature Terms based Method for Improving Text Summarization with Supervised POS Tagging Suneetha Manne Associate Professor Department of Information Technology VR Siddhartha Engineering College Vijayawada, Andhra Pradesh S. Sameen Fatima PhD, Professor and Head Dept. of Computer Science & Engg. University College of Engineering Osmania University, Hyderabad ABSTRACT Text summarization is the process of distilling the most important information from a source to produce an abridged version for a particular user and task. When this is done by means of a computer, i.e. automatically, it calls as Automatic Text Summarization. Summarization can be classified into two approaches: extraction and abstraction. Extraction based summaries are produced by concatenating several sentences taken exactly as they appear in the texts being summarized. Abstraction based summaries are written to convey the main information in the input and may reuse phrases or clauses from it. This paper focuses on extraction approach. The goal of text summarization based on extraction approach is sentences selection. One of the methods to obtain the sentences is to assign some feature terms of sentences for the summary called ranking sentences and then select the best ones. The first step in summarization by extraction is the identification of important features. In our approach 1000 computer science related research papers are used as test documents. Each document is prepared by preprocessing process: sentence segmentation, tokenization, stop word removal, case folding, lemmatization, and stemming. Then, using important features, sentence filtering features, data compression features and finally calculating score for each sentence. The proposed text summarization is based on HMM tagger to improve the quality of the summary. Here, comparing our results with the existing summarizers which are Copernicus summarizer, Great summarizer and Microsoft Word 2007 summarizers etc. The proposed system is also tested with four types‘ similarities: Cosine, Jaccard, Jaro- winkler and Sorenson similarities. The results show that the best quality for the summaries was obtained by feature terms method. General Terms Text Mining, Information Extraction, Automatic Text Summarization, Natural Language Processing, POS Tagging. Keywords Term frequency, Term weight, Inverse sentence frequency, Noun and verb chunking, Sentence Position, Sentence Length, Verb featured sentences, Compression Ratio, Retention Ratio, Data Compression features, HMM Tagger. 1. INTRODUCTION Today, the amount of information available is tremendous and the number of pages available on the Internet almost doubles every year [1]. In January 2012, the number of hosts advertised in the DNS is 888,239,420 [2]. In order to find the relevant information and making use of this is becoming more difficult. Nowadays, most of the information comes from the Internet is in the form of text. Due to lack of presence of contextual and discourse awareness, most of the users are not aware of the relevance and appropriateness of the information in the web documents. To reduce the effort in finding the relative information with contextual and discourse awareness, summarization is a powerful technique which is helpful for the user to find appropriate information. In this context, automatic text summarization is an area of research that attracted by many researchers [20]. According to Spark-Jones ―Text summarization is a reductive transformation of a source text into a summary text by extraction or generation‖ [3]. Automatic summarization is nothing but a computer generated summary. Though, the automatic text summarization is predominant research area very few software tools are available to the end users. The reason for this is that the quality of summaries produced by automatically is low. In general, creation of a good summary requires a lot of intelligence. The clear evidence of this is use of summaries in document understanding and the relevant information from the large volume of texts. Various approaches are available in the literature to perform text summarization, Information extraction (IE), NLP and others. IE is a special type of text summarization that is designed specifically for the input documents [4]. Thus, in this investigation the summary generation for academic research papers have been proposed. Summarization is a hard problem of Natural Language Processing because, to do it properly, one has to really understand the point of a text. This requires semantic analysis, discourse processing, and inferential interpretation. The last step, especially, is complex, because systems without a great deal of world knowledge simply cannot do it. Therefore, attempts so far of performing true abstraction--creating abstracts as summaries--have not been very successful. Also machine learning techniques from closely related fields such as information retrieval or text mining have been adapted to help automatic summarization [5]. Researchers and students constantly face the problem that, it is almost impossible to read most of the newly published papers to be informed of the latest progress and when they work on a research project, the time spent on reading literature review seems endless. The goal of this research is to design a domain independent automatic text extraction based summarization system to alleviate, if not totally solve, this problem. The organization of the paper is as follows. Section 2 discusses the existing techniques of text document summarization. In Section 3 we discuss the proposed approach of automatic text summarization. Experimental results and Performance evaluation are depicted in Section 4. Section 5 concludes the paper.