978-1-5090-5627-9/17/$31.00 ©2017 IEEE Statistical Parsing of Bangla Sentences by CYK Algorithm Ayesha Khatun Dept. of Computer Science & Engineering Chittagong University of Engineering & Technology (CUET), Chittagong, Bangladesh ayeshankhatun@gmail.com Mohammed Moshiul Hoque Dept. of Computer Science & Engineering Chittagong University of Engineering & Technology (CUET), Chittagong, Bangladesh moshiulh@yahoo.cm Abstract—Statistical parsing is the task of enabling the parser to find the most probable parse of a sentence according to probabilistic context-free grammar. Crucial use of statistical parser is to solve the disambiguation problem. This paper proposes a statistical parser using probabilistic version of Cocke-Younger- Kasami (CYK) algorithm to parse different kinds of Bangla sentences. For improving parsing efficiency, this model also uses left binarization technique to grammar. Rule probability and word probability is used to generate different probabilities for the same structure of a sentence. Experiment results with different kinds of sentence shows the effectiveness of the propose parser with reasonable accuracy. Keywords— Statististical parsing, probabilistic context-free grammar, rule generator, Chomsoky normal form, binarization. I. INTRODUCTION Natural language sentences are ambiguous by nature and sentences have multiple parses. A statistical model is a systematic platform which assigning the score to the parse trees and chooses the one which has height score. The score is defined in term of probabilistic value. The syntactical ambiguity is a crucial problem for parsing, it is very difficult to manually define a grammar whose rules find out only one parse from an exponential number of possible parses and probabilistic model provide a well-established method for selection between the alternatives [1]. The concept of the statistical parser is related to probabilistic rules learning from a corpus text. The probability of the parse tree is calculated by multiplying the probabilities of all words and grammatical rules, those grammatical rules related to creating a parse tree. The statistical parser is a dynamic programming technique which makes able to parse most probable parse tree of a sentence. This proposed model used the probabilistic version of CYK [2] algorithm for statistical parsing of sentences dynamically. Bangla language processing is a very difficult task because of variation of sentences and many ambiguities occurred during parsing a sentence. In order to produce accurate parse structure of Bangla sentences, statistical parsing would be the best choice. The most crucial use of the statistical parser is in Bangla machine translation system. The statistical parser is help to translate Bangla to other language more accurately. It will also play an important role in the modeling of language for Bangla. Parsing can be mainly classified into three categories, rule- based, statistical based and generalized parsers. The production rules are recursively applied in the rule-based parsing, as a result many ambiguities may arise. To detect or resolve ambiguities, it is very difficult task to write a complex grammar rule. The statistical parsing resolves ambiguity with the help of experience or by training corpus. The traditional parsing methods are used to find correct parse tree where statistical parser help to find the best parse tree using statistical information. Bangla has a rich historical and cultural background. To keep Bangla history, culture, Literature existent and to introduce it globally we have to digitalize Bangla language. A statistical model can play a vital role for this purpose. Bangla is the fourth largest language of the world and in spite of having over 245 million native speakers, still now Bangla language have a negligible amount of work on statistical natural language processing. In this paper, a statistical approach is proposed to parse different kinds of Bangla sentences statistically. To achieve the goal, stochastic or probabilistic context-free grammar is introduced. As we used dynamic CYK algorithm, ambiguity can easily detect. For increasing parsing efficiency, this model used left binarization technique. To parse more accurate structure, this model considered the rule probability as well as word probability. The proposed system is evaluated by the wide range of sentences with changeable word lengths and show that this probabilistic parser can parse Bangla sentence successfully. II. RELATED WORK Many works have done on the statistical parsing of English sentence. Sampson proposed the first Stochastic or probabilistic approach to parse sentences in [3]. A series of five statistical model of translation is described in [4]. Michael Collins developed a statistical parser [5] which plays a tremendously powerful role in NLP. Statistical parsing system based the Penn Wall Street Journal Treebank proposed in [6]. However, parsing is done based on the statistical decision for complex grammar is shown in [7]. Some work has been performed on the parsing methodology of Bangla. A parsing methodology of top down and bottom up parsing approach by using phrase structure rules to parse Bangla simple sentence in [8]. While a rule based 655 International Conference on Electrical, Computer and Communication Engineering (ECCE), February 16-18, 2017, Cox’s Bazar, Bangladesh