Effective summarization method of text documents Rasim Alguliev, Ramiz Aliguliyev Institute of Information Technology Azerbaijan National Academy of Sciences Baku, Azerbaijan director@iit.ab.az , aramiz@iit.ab.az Abstract The actual work is dedicated to the problems of text documents classification through summarization. There are various approaches to text documents classification. Most of classification methods, which rely on Vector Space Model, analyze separate words in documents. To increase the accuracy of documents classification it is necessary to take into account more informative features of documents in question. For this purpose a summarization method called preprocessing in documents classification has been suggested in this work. While summarization this method takes into account weight of each sentence in the document. The essence of the method suggested is in preliminary identification of every sentence in the document with characteristic vector of words, which appear in the document, and calculation of relevance score for each sentence. The relevance score of sentence is determined through its comparison with all the other sentences in the document and with the document title by cosine measure. Prior to application of this method the scope of features is defined and then the weight of each word in the sentence is calculated with account of those features. The weights of features, influencing relevance of words, are determined using genetic algorithms. 1. Introduction Of all kinds of information accumulated on WWW, the text data represents the most interest as a rule. Every day hundreds of new text documents appear on Internet increasing already enormous amount of accessible text information. With all this, the text search is not limited to the search of relevant web page on Internet, but is used in numerous important applications in contemporary world such as text objects classification, creation of automated reference systems etc. In such cases, the problem of mining of text documents arises sharply. There are various methods of data mining [1]. Classification is one of those methods. Classification consists of breaking down the sample of text documents into non-overlapping groups of documents with aim of ensuring maximal “proximity” (similarity) between documents of each group, corresponding to certain topic and maximal difference between groups [10]. There are various approaches to classification of text documents [1, 22]. Most of classification methods rely on Vector Space Model and analyze separate words in documents [17, 18, 19]. Vector Space Model represents documents as characteristic vector of words, which appear in whole array of documents in question. Each characteristic vector contains weights of words (usually number of occurrences of a word) appearing in the array of documents. Similarity between documents is measured with the use of one of similarity measures such as cosine measure, Euclidean measure and measure of Jaccard. To attain higher level of accuracy in documents classification it is necessary to take into account more informative features of documents. For this purpose, for instance, in work [11] the weights of HTML tags, which affect efficiency of information retrieval, are defined using genetic algorithms. In work [4], documents classification is carried out at the level of separate words, but unlike classical works, the relevance of each word here is defined in relation to their informative features, which are the occurrences of a word in the title, emphasis of a word by means of italic, bold fonts or its underlining and position of a word on the page. A DIG (Document Index Graph) algorithm based on graph theory and taking into account phrases and their weights was suggested in work [8]. Here the term “phrase” means a sequence of words, and not the grammatical structure of a sentence. An algorithm GIS (Generalized Instance Set) suggested in work [13] combines the methods of k -nearest neighbors and linear classifier. Over the recent years a texts summarization technique [20] called “preprocessing in classification” has been widely used in classification of text documents. The summarization technique is used for extraction of important contexts [2], sentences [6, 7, 12, 21], and paragraphs [16, 9]. The effect of context,