Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Numerical Analysis and Computer Science Royal Institute of Technology, Sweden {johnny, knutsson}@nada.kth.se Abstract This article presents a robust probabilistic method for the detection of context-sensitive spelling errors. The algorithm identifies less- frequent grammatical constructions and at- tempts to transform them into more-frequent constructions while retaining similar syntactic structure. If the transformations result in low- frequency constructions, the text is likely to contain an error. A first unsupervised approach uses only information derived from a part-of- speech tagged corpus. This experiment shows a good error detection capacity but also a high rateoffalsealarms,inmanycasesduetophrase and clause boundaries. In a second approach, we combine the first method with robust phrase andclauserecognitiontoavoidmanyofthefalse alarms in the first experiment. A comparative evaluationoftheexperimentsshowsthatthein- troduction of linguistic knowledge dramatically increases the precision of the error detection method. 1 Introduction Even though spell checkers are widely used and commercialized, many spelling mistakes in writer’s texts are still left for humans to iden- tify. In this paper we focus on a category of error types that we consider very difficult to detect. One of these error types is the cate- gory of so-called context-sensitive spelling er- rors (e.g. Mays et al. (1991)). The other error type that we are focusing on is erroneously split compounds which are frequent in compounding languages. Severalapproacheshavebeenproposedtode- tect and correct context-sensitive spelling er- rors. Most approaches operate on sets of easily confused words and are based on con- text features for each word in the confusion set, such as word and parts-of-speech context (Yarowsky (1994), Golding (1995), Golding and Roth (1996)). Furthermore, some include part- of-speech (POS) tag trigram information to determine which candidate is the most likely (Mays et al. (1991), Golding and Schabes (1996)). A rule-based approach using machine learningisgivenin(ManguandBrill,1997). Al- thoughusefulfordetectionandcorrection,these approaches require a list of confusion sets pre- dicted beforehand. Themaindrawbackwiththealgorithmsmen- tioned for error detection is that they require knowledge about the errors to be found. Of- ten, such errors are not known in advance and the errors predicted may not be sufficient. We want to be able to detect errors from categories of difficult spelling errors such as spelling errors resulting in an existing word. Furthermore, we would like something more general and robust. In this paper we propose a probabilis- tic method for detection of context-sensitive spelling errors and erroneously split com- pounds. The main contribution of this paper is an approach to mitigate the problem of sparse data. Tothisend,weuseanovelcombinationof existing techniques, such as POS tagging, shal- low parsing and phrase transformations. The basic idea is to identify rare sequences of morpho-syntactic tags and by different meth- ods determine if the sequences are rare due to the sparse data problem or phrase- and clause boundaries. If a rare sequence cannot be trans- formedintoamorefrequentoneusingthemeth- ods, the sequence is considered to contain an error. We have investigated the methods in two experiments. Thefirstexperimentisconductedwithanun- supervised method using only information de- rived from a rather small part-of-speech tagged