Proceedings of the 1st Conference of the Asia-Paciﬁc Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 165–171 December 4 - 7, 2020. c 2020 Association for Computational Linguistics 165 Generating Inflectional Errors for Grammatical Error Correction in Hindi Ankur Sonawane, Sujeet Kumar Vishwakarma, Bhavana Srivastava, Anil Kumar Singh Indian Institute of Technology (BHU),Varanasi {sankur.shrikant.eee16, sujeetkr.vishwakarma.eee16}@itbhu.ac.in {bhavanasrivastava.rs.cse17, aksingh.cse}@iitbhu.ac.in Abstract Automated grammatical error correction has been explored as an important research prob lem within NLP, with the majority of the work being done on English and similar resource rich languages. Grammar correction using neu ral networks is a dataheavy task, with the re cent state of the art models requiring datasets with millions of annotated sentences for proper training. It is difficult to find such resources for Indic languages due to their relative lack of digitized content and complex morphology, compared to English. We address this problem by generating a large corpus of artificial inflec tional errors for training GEC models. More over, to evaluate the performance of models trained on this dataset, we create a corpus of real Hindi errors extracted from Wikipedia ed its. Analyzing this dataset with a modified ver sion of the ERRANT error annotation toolkit, we find that inflectional errors are very com mon in this language. Finally, we produce the initial baseline results using state of the art methods developed for English. 1 Introduction Grammatical Error Correction (GEC) involves automatically correcting errors in written text, whether relating to orthography, syntax or fluency. Today, most approaches for solving this problem highlight statistical and deep learning methods as opposed to rulebased methods. These methods treat GEC as a translation task, from an ungram matical to a grammatically correct form of the same language (Brockett et al., 2006). This re quires a considerable amount of supervised data in the form of ‘edits’, which are pairs of incor rect and correct sentences. Researchers have re cently done remarkable work on English and a few other resourcerich languages and have released many datasets to evaluate state of the art meth ods. Comparatively less attention has been given to low resource languages, and Indic languages have been neglected in particular. Systems like UTTAM (Jain et al., 2018) and SCMIL (Etoori et al., 2018) have applied probabilistic approaches and deep learning, respectively, to the problem of spelling correction in Indic languages. Moreover, simple ngram based models (Singh and Singh, 2019; Kanwar et al., 2017) have been used for “RealWord” error correction, which is a very sim ilar problem to GEC. However, to our knowledge, no such work exists for true GEC in this language. Thus, we sum up our contributions in the following manner: 1. We create a parallel corpus of synthetic errors by inserting errors into grammatically correct sentences using a rulebased process, focus ing specifically on inflectional errors. Since this process is generic, it can easily be ex tended to other Indic languages. 2. We scrape Hindi edits from Wikipedia and fil ter them to provide another smaller corpus of errors. Since this corpus is extracted from a relatively natural source, it can be useful for evaluating GEC systems. We also analyze this corpus using an extended version of the ERRANT toolkit. 3. We evaluate a few well studied approaches for languages like English on these datasets, and thus produce the initial GEC results for the Hindi language. The code and data to reproduce our experiments are available at http://github.com/s-ankur/hindi_ grammar_correction. 2 Related Work The most common GEC datasets come from correctionannotated language learner essays. The English learner corpora include those from shared