New Page 1 Error Correction using DOP Menno van Zaanen School of Computer Studies University of Leeds Leeds, LS2 9JT, U.K. menno@scs.leeds.ac.uk Abstract In order to create a robust parser, it is necessary that the parser has a well-defined behaviour on what to do when it is fed with incorrect input. There are several ways to cope with incorrect input. The method described here starts with expanding an Earley parser to let it correct erred input. The corrections consist of inserting or deleting tokens from the input, but other corrections can be simulated by combining these two operations. When parsing an input string, typically more than one derivation can be generated. This is especially the case if the input has been corrected. The Data Oriented Parsing (DOP) model is used to disambiguate among the possible derivations. DOP selects the most probable derivation based on a corpus of derivations that were previously generated by the Earley parser. The complete system, i.e. the expanded Earley parser in combination with DOP, has been successfully tested on correcting corrupted C programs. The next step is to use the system for error correction in natural language processing. 1 Introduction According to [GJ95] parsing is: the process of structuring a linear representation in accordance with a given grammar. A parser 1 takes a sentence 2 as input and, if the sentence is correct according to the grammar, it will generate a structure. This structure describes how the sentence can be generated from the grammar. Unfortunately, quite often a parser encounters sentences that cannot be generated from the grammar. Since the sentences are not part of the language described by the grammar, the parser cannot generate a structure for them. Since we assume that the language describes all correct sentences, the sentence is incorrect. Several types of error exist, but we will focus on syntactical errors. Morphological, semantic and pragmatic errors will not be discussed, although morphological error sometimes result in syntactical errors. When we want to have a parser that can handle sentences that are not part of the language, a mechanism is needed that allows the parser to handle errors that may occur in the input. These mechanisms are called error handling. First of all some general information on error handling is given, followed by a brief discussion on the used parser. After these sections a description of the disambiguation method is given, concluding the basic concepts needed to understand the rest of the paper. After the description of the basic concepts the implemented systems will be described. When necessary changes in the basic concepts will be discussed there. Finally, a conclusion and some future research will be given. 1 For more information on parsing, see [GJ95], [AU72], [ASU86] and [WM95]. 2 For more information on sentences, languages and grammars, see [Lin90]. 1 Page 1