An automatic method for revising ill-formed sentences based on N -grams Theologos Athanaselis, Stelios Bakamidis, Ioannis Dologlou Institute for Language and Speech processing (ILSP) Artemidos 6 & Epidavrou GR-151 25 Maroussi Greece {tathana;bakam;ydol}@ilsp.gr Abstract A good indicator of whether a person really knows the context of language is the ability to use in correct order the appropriate words in a sentence. The “scrambled” words cause a meaningless and ill formed sentences. Since the language model, is extracted from a large text corpus, it encodes the local dependencies of words. The word order errors usually violated the syntactic rules locally and therefore the N-grams can be used in order to fix ill-formed sentences. This paper presents an approach for repairing word order errors in text by reordering words in a sentence and choosing the version that maximizes the number of trigram hits according to a language model. The novelty of this method concerns the use of an efficient confusion matrix technique for reordering the words. The comparative advantage of this method is that works with a large set of words, and avoids the laborious and costly process of collecting word order errors for creating error patterns. 1. Introduction Writers sometimes make errors that violate language’s grammar e.g (sentences with wrong word order). What appears to be given in all languages is that words can not be randomly ordered in sentences, but that they must be arranged in certain ways, both globally and locally. For example, in English the normal way of ordering elements is subject, verb, object (Boy meets girl) [1]. Subjects and objects are composed of noun phrases, and within each noun phrase are elements such as articles, adjectives, and relative clauses associated with the nouns that head the phrase (the tall woman who is wearing a hat). Native speakers of a language seem to have a sense about the order of constituents of a phrase, and such knowledge appears to be outside of what one learns in school [2]. Automatic grammar checking is traditionally done by manually written rules, constructed by computer linguists. Methods for detecting grammatical errors without manually constructed rules have been presented before. Atwell [3] uses the probabilities in a statistical part-of the speech tagger, detecting errors as low probability part of speech sequences. Golding [4] showed how methods used for decision lists and Bayesian classifiers could be adapted to detect errors resulting from common spelling confusions among sets such as “there”, “their” and “they’re”. He extracted contexts from correct usage of each confusable word in a training corpus and then identified a new occurrence as an error when it matched the wrong context. Chodorow and Leacock [5] suggested an unsupervised method for detecting grammatical errors by inferring negative evidence from edited textual corpora. Heift [6,7] released the German Tutor, an intelligent language tutoring system where word order errors are diagnosed by string comparison of base lexical forms. Bigert and Knutsson [8] presented how a new text is compared to known correct text and deviations from the norm are flagged as suspected errors. Sjobergh [9] introduced a method of grammar errors recognition by adding errors to a lot of (mostly error free) unannotated text and by using a machine learning algorithm. Unlike most of the approaches, the proposed method does not work only with a limited set of words. The use of parser and/or tagger is not necessary. Also, it does not need a manual collection of written rules since they are outlined by the statistical language model. A comparative advantage of this method is that avoids the laborious and costly process of collecting word order errors for creating error patterns. Finally, the performance of the method does not depend on the word order patterns which vary from language to language. The paper is organized as follows: The language model in section 2. The architecture of the entire system follows in section 3. The 4 th section describes the technique for reducing the permutations. The 5 th section specifies the method that is used for searching valid trigrams in a sentence. The results of using TOEFL experimental scheme are discussed in section 6. Finally, the concluding remarks are made in section 7. 2. Language model The language model (LM) that is used subsequently is the standard statistical N-grams. The N-grams provide an estimate of ) ( W P , the probability of observed word sequence W . Assuming that the probability of a given word in an utterance depends on the finite number of preceding words, the probability of N-word string can be written as: ∏ = - - - - = N i n i i i i w w w w P W P 1 ) 1 ( 2 1 ) ,..., , | ( ) ( (1) N-grams simultaneously encode syntax, semantics and pragmatics and they concentrate on local dependencies [10]. This makes them very effective for languages where word order is important and the strongest contextual effects tend to come from near neighbours. A statistical language model describes probabilistically the constraints on word order found in language: typical word sequences are assigned high probabilities, while atypical ones are assigned low probabilities. N-grams have also been chosen, because the N- gram probability distributions can be computed directly from text data, yielding hence no requirement to have explicit linguistic rules (e.g. formal grammars). The statistical language model consists of bigrams (N=2) and trigrams (N=3). Speech Prosody 2006 Dresden, Germany May 2-5, 2006 ISCA Archive http://www.isca-speech.org/archive Speech Prosody 2006, Dresden, Germany, May2-5, 2006