Contextual Spellchecking Based on N-grams Ivan Srdić University of Zagreb Faculty of Electrical Engineering and Computing Unska 3, Zagreb, Croatia ivan.srdic@fer.hr Gordan Gledec University of Zagreb Faculty of Electrical Engineering and Computing Unska 3, Zagreb, Croatia gordan.gledec@fer.hr Abstract. Croatian Academic Spellchecker is an online web-service used for almost 20 years by thousands of users every day. In recent years, the service enabled rudimentary contextual spellchecking, based on pattern matching. In this paper we describe how it is possible to perform n-gram based contextual spellchecking of texts written in Croatian, regardless of the orthographic complexity of the Croatian language. Simple upgrade of the existing implementation was achieved by separating the system into several components. Using a well-known classifier, tweaking the frequency estimator and separating errors into confusion sets resulted in a contextual spellchecking system with a high score of F1 = 0.95 on the examined example. Keywords. Contextual spellchecking, statistical approach; n-grams 1 Introduction With the rise of remote communication, there is an increased need for a system that corrects orthographic and grammatical errors. This paper demonstrates an enhancement to the contextual spellchecking system of Haschek, a Croatian online spellchecker (service available at https://ispravi.me/) designed by Šandor Dembitz and described in Dembitz et al. (2011). Contextual spelling errors, being the most complicated type of errors, have always been difficult to detect and correct. A spelling error in an intended word may result in the wrong real-word; that change will go undetected in a traditional spellchecker (sljedeći vs. slijedeći, zahtijeva vs. zahtjeva, etc.). For both detection and correction of contextual errors, a statistical language model is needed. For every word that is suspected to be an error, a word with higher probability of occurrence in the given context must be chosen. The substitute word can in be any word in the language, making this computationally theoretically impossible for a smaller system. The remainder of this paper is organized as follows: Chapter 2 describes the theoretical background behind our research; Chapter 3 explains the data sets used – n- grams collected in the more than 20 years of the usage of Croatian spellchecker and describes the system architecture. Chapter 4 explains the results and Chapter 5 gives detail about future research. Final chapter concludes the paper. 2 Theoretical background In this section, we describe the theoretical background behind our proposed solution to the problem of contextual spellchecking for Croatian language. 2.1 Confusion sets To solve the problem of detecting and correcting spelling errors, this paper proposes a solution modeled on Kim et al. (2013). A confusion set is a set of words for which there is a high probability of replacement due to either a typographical error or lack of knowledge about the language. An example of a confusion set is {zahtjeva, zahtijeva} or {sljedeći, slijedeći}. The confusion sets can be generated manually or programmatically by using Levenshtein distance. Levenshtein distance is a measure of similarity of two texts (Martin, Jurafsky, 2000). There are four kinds of operations that can be made on a word – insertion of a letter, deletion of a letter, substitution of a letter or transposition of two letters. While using an edit distance higher than one is possible, due to the nature of the Croatian language, the most common errors are within edit distance of 1. Using an edit distance of 2 or more would increase the number of words in confusion sets and thus decrease the results of the classifier. In addition to that, the probability of making two errors in one word is very low. 2.2 Classifier The classifier used in this paper is based on the well- known Naive Bayes classifier. Proceedings of the Central European Conference on Information and Intelligent Systems _________________________________________________________________________________________________________________ 29 _________________________________________________________________________________________________________________ 28th CECIIS, September 27-29, 2017, Varaždin, Croatia