Proceedings of InSTIL/ICALL2004 – NLP and Speech Technologies in Advanced Language Learning Systems – Venice 17-19 June, 2004 Evaluating Students’ Summaries with GETARUNS Rodolfo Delmonte Department of Language Sciences Università Ca’ Foscari Ca’ Garzoni-Moro - San Marco 3417 - 30124 VENEZIA e-mail: delmont@unive.it website - http://project.cgm.unive.it Abstract Evaluating summaries is currently performed by the use of statistically-based tools which lack any linguistic knowledge and are unable to produce grammatical and semantic judgements (Landauer et al., 1997). However, summary evaluation needs precise linguistic information with a much finer-grained coverage than what is being offered by currently available statistically based systems. We assume that the starting point of any interesting application in these fields must necessarily be a good syntactic-semantic parser. In this paper we present the system for text understanding called GETARUNS, General Text and Reference Understanding System (Delmonte, 2003). The heart of the system is a rule-based top-down parser, which uses an LFG oriented grammar organization. Lately, a less constrained version of the parser for the application field of text summarization has been developed, which allows the system to recover gracefully from failures. To this end, the parser is couple with another concurrent parsing processes: a partial or shallow parse is always produced and used to recover from complete failures. GETARUNS, has a highly sophisticated linguistically based semantic module which is used to build up the Discourse Model. Semantic processing is strongly modularized and distributed amongst a number of different submodules which take care of Spatio-Temporal Reasoning, Discourse Level Anaphora Resolution. Evaluation taps information from the Discourse Model and uses Predicate Argument Structures (PAS) to detect students’ understanding of the text to summarize. It also uses the output of the Anaphora Resolution Module to check for most relevant topics in the text which the student should have addressed in his/her summary. The system uses a Topics-Stack while processing the text in order to corefer referential expressions: The Topic-Stack Hierarchy gauges nominal heads as either Main, Secondary or Potential Topic. This grading is used as a score that allows the system to detect the most relevant entities in the text at the end of the computation. 1. Introduction Currently available summary and essay evaluation systems are basically based on statistical and mathematical procedures which are used to assess students linguistic abilities. We are here referring to such tools as LSA-based Summary Street®. Latent Semantic Analysis (Landauer T. et al, 1997) is a statistical theory of meaning which tells the student “…how well your summary covers the information in the original text. It will tell you if your summary is too long for a good summary.” It is unable to check for grammaticality issues, neither coherence nor cohesiveness is checked, and what is worse no semantic soundness can be checked. LSA techniques simply allow looking for semantic similarities through comparison of most frequent content words with a knowledge of the surrounding most frequent content words at sentence and paragraph level. LSA does not take into account content words with frequency of occurrence equal-lower than two; it does not take into account the order in which content words cooccur. It capture text similarity in terms of differences in word choice among different texts. It seems to me to be a too poor a way of characterizing meaning: on the contrary, the authors speak of LSA as a tool that “captures a great deal of the similarity of meanings expressed in discourse”, and use that as “…the basis for performing automated scoring of essays”. Seen that LSA does not take into account word order and discards such important elements as negation items it follows that there is no way to tell whether simple coocurrence indicates similarity of meaning. The experiment commented in the same article is very uncouth. At first glance, Landauer et al. seem to be concerned only with destroying what has gone on in Syntactic Theory and its contribution to the determination of meaning. However in the following paragraph they come up with the opposite statement: The fact that LSA can capture as much of meaning as it does without using word order shows that the mere combination of words in passages constrains overall meaning very strongly. How can this be? In addition to the contrary theoretical presumptions mentioned earlier, various intuitive and rational arguments suggest that such representations must fall far short of extracting as much meaning from text as do human readers. For instance, the following two sentences are identical for LSA, but have very different meanings for a human reader: “It was not the sales manager who hit the bottle that day, but the office worker with the serious drinking problem.”; “That day the office manager, who was drinking, hit the problem sales worker with a bottle, but it was not serious… Nonetheless, what such examples prove is only that a method that ignores word order