Using Latent Semantic Analysis in Text Summarization and Summary Evaluation Josef Steinberger * jstein@kiv.zcu.cz Karel Ježek * Jezek_ka@kiv.zcu.cz Abstract: This paper deals with using latent semantic analysis in text summarization. We describe a generic text summarization method which uses the latent semantic analysis technique to identify semantically important sentences. This method has been further improved. Then we propose two new evaluation methods based on LSA, which measure content similarity between an original document and its summary. In the evaluation part we compare seven summarizers by a classical content-based evaluator and by the two new LSA evaluators. We also study an influence of summary length on its quality from the angle of the three mentioned evaluation methods. Key Words: Generic Text Summarization, Latent Semantic Analysis, Summary Evaluation 1 Introduction Generic text summarization is a field that has seen increasing attention from the NLP community. The actual huge amount of electronic information has to be reduced to enable the users to handle this information more effectively. We mention here classes of summarization methods and a method based on LSA which has been recently published. We have further modified and improved this method. One of the most controversial parts of the summary research is its evaluation process. Next part of the article deals with possibilities of summary evaluation. We propose there two new evaluation methods based on LSA, which measure a content similarity between an original document and its summary. At the end of the paper we present evaluation results and further research directions. 2 Generic Text Summarization Methods Generic text summarization approaches are divided into four classes. The first class we call heuristic approaches. This extraction methods use for scoring sentences easy techniques as for example the sentence position within the document or an occurrence of a word from the title in a sentence [6]. The next group includes approaches based on a document corpus (corpus-based methods) [7]. An example of such a method is TF.IDF (term frequency · inverse document frequency). The third class consists of methods which take a discourse structure into account. An example is the lexical chains method which searches for chains of context words in the text [8]. The last group is called knowledge-rich approaches. They are the most advanced but can be used only in particular domains (e. g. STREAK – summaries of basketball games [9]). A quite new approach in text summarization uses the latent semantic analysis. * Department of Computer Science and Engineering, Univerzitní 22, CZ-306 14 Plzeň