International Journal of Computer Applications (0975 – 8887) Volume 28– No.1, August 2011 7 Chunk-based Grammar Checker for Detection Translated English Sentences Nay Yee Lin University of Computer Studies, Yangon, Myanmar Khin Mar Soe Natural Language Processing Laboratory University of Computer Studies, Yangon, Myanmar Ni Lar Thein University of Computer Studies, Yangon, Myanmar ABSTRACT Machine Translation systems expect target language output to be grammatically correct within the frame of proper grammatical category. In Myanmar-English statistical machine translation system, the target language output (English) can often be ungrammatical. To solve this need, we propose an ongoing chunk-based grammar checker for translated English sentences. Most of the typical grammar checkers can detect ungrammatical sentences and seek for what error it is. However, they often fail to detect grammar errors for translated English sentences such as missing words. Therefore, we intend to develop a grammar checker by using trigram language model and rule based model. The system identifies the chunk types and generates context free grammar (CFG) rules for recognizing grammatical relations of chunks. In this paper, we present an overview of the current research being carried out three main functions: detecting sentence patterns in chunk types, analyzing chunk errors and correcting the errors. We hope that it encourages improving the translation quality of Myanmar to English. General Terms Statistical Machine Translation, Context Free Grammar, Trigram Language model, Rule based model Keywords Chunk-based Grammar Checker 1. INTRODUCTION Grammar is the set of structural rules that govern the composition of clauses, phrases, and words in any given natural language. Grammar checking is one of the most widely used tools within natural language processing (NLP) applications. Grammar checkers check the grammatical structure of sentences based on morphological processing and syntactic processing. These two steps are part of natural language processing intending to understand natural languages. Morphological processing is the step where individual words are analyzed into their components and non- word tokens, such as punctuation, are separated from the words. Syntactic processing is the analysis where linear sequences of words are transformed into structures that show grammatical relationships between the words in the sentence [6]. Grammar checkers are most often implemented as a feature of a larger program, such as a word processor. However, such a feature is not available as a separate free program for machine translation. Therefore, we propose a grammar checker as a complement for machine translation. Three main approaches are widely used for grammar checking in a language; syntax-based checking, statistics-based checking and rule-based checking. In syntax based grammar checking, each sentence is completely parsed to check the grammatical correctness of it. The text is considered incorrect if the syntactic parsing fails. In statistics-based approach, POS tag sequences are built from an annotated corpus, and the frequency, and thus the probability, of these sequences are noted. The text is considered incorrect if the POS-tagged text contains POS sequences with frequencies lower than some threshold. The statistics based approach essentially learns the rules from the tagged training corpus. In rule-based approach, the approach is very similar to the statistics based one, except that the rules must be handcrafted [5]. Among these approaches, this paper presents a grammar checker by using statistical and rule based model. In this approach, the translated English sentence is used as an input. Firstly, this input sentence is tokenized and tagged POS to each word. Then these tagged words are grouped into chunks by parsing the sentence into a form that is a chunk based sentence structure. After making chunks, these chunks relationship for input sentence are detected by using trained sentence patterns. If the sentence pattern is incorrect, we analyze the chunk errors and then correct the errors. The rest of the paper is organized as follows. Section 2 presents the related work of this paper. Section 3 describes the overview of Myanmar-English Statistical Machine Translation System. In section 4, the proposed chunk based grammar checker is explained. Section 5 reports the experimental results of our proposed system and finally section 6 concludes the paper. 2. RELATED WORK Many researchers have been worked grammar checking in NLP for various languages. In the following paragraphs, we discuss briefly some of the related work. N-gram statistical grammar checker for both Bangla and English is proposed in [8]. It considers the n-gram based analysis of words and POS tags to decide whether the sentence is grammatically correct or not. In [1], a model is applied for reducing errors in translation using Pre-editor for Indian English Sentences. They have used a major corpus in tourism and health domains. They formed structures of English practiced mostly in India have been identified to design the predictor. This was incorporated in the AnglaBharti Engine and gave significant improvement in the Machine Translation output. An alternative approach checked the Swedish grammar for evaluation tool and post processing tool of Statistical Machine