978-1-5090-5627-9/17/$31.00 ©2017 IEEE
Statistical Parsing of Bangla Sentences by CYK
Algorithm
Ayesha Khatun
Dept. of Computer Science & Engineering
Chittagong University of Engineering & Technology
(CUET), Chittagong, Bangladesh
ayeshankhatun@gmail.com
Mohammed Moshiul Hoque
Dept. of Computer Science & Engineering
Chittagong University of Engineering & Technology
(CUET), Chittagong, Bangladesh
moshiulh@yahoo.cm
Abstract—Statistical parsing is the task of enabling the parser
to find the most probable parse of a sentence according to
probabilistic context-free grammar. Crucial use of statistical
parser is to solve the disambiguation problem. This paper proposes
a statistical parser using probabilistic version of Cocke-Younger-
Kasami (CYK) algorithm to parse different kinds of Bangla
sentences. For improving parsing efficiency, this model also uses
left binarization technique to grammar. Rule probability and word
probability is used to generate different probabilities for the same
structure of a sentence. Experiment results with different kinds of
sentence shows the effectiveness of the propose parser with
reasonable accuracy.
Keywords— Statististical parsing, probabilistic context-free
grammar, rule generator, Chomsoky normal form, binarization.
I. INTRODUCTION
Natural language sentences are ambiguous by nature and
sentences have multiple parses. A statistical model is a
systematic platform which assigning the score to the parse trees
and chooses the one which has height score. The score is defined
in term of probabilistic value. The syntactical ambiguity is a
crucial problem for parsing, it is very difficult to manually define
a grammar whose rules find out only one parse from an
exponential number of possible parses and probabilistic model
provide a well-established method for selection between the
alternatives [1]. The concept of the statistical parser is related to
probabilistic rules learning from a corpus text. The probability
of the parse tree is calculated by multiplying the probabilities of
all words and grammatical rules, those grammatical rules related
to creating a parse tree. The statistical parser is a dynamic
programming technique which makes able to parse most
probable parse tree of a sentence. This proposed model used the
probabilistic version of CYK [2] algorithm for statistical parsing
of sentences dynamically. Bangla language processing is a very
difficult task because of variation of sentences and many
ambiguities occurred during parsing a sentence. In order to
produce accurate parse structure of Bangla sentences, statistical
parsing would be the best choice. The most crucial use of the
statistical parser is in Bangla machine translation system. The
statistical parser is help to translate Bangla to other language
more accurately. It will also play an important role in the
modeling of language for Bangla.
Parsing can be mainly classified into three categories, rule-
based, statistical based and generalized parsers. The production
rules are recursively applied in the rule-based parsing, as a result
many ambiguities may arise. To detect or resolve ambiguities, it
is very difficult task to write a complex grammar rule. The
statistical parsing resolves ambiguity with the help of experience
or by training corpus. The traditional parsing methods are used
to find correct parse tree where statistical parser help to find the
best parse tree using statistical information.
Bangla has a rich historical and cultural background. To keep
Bangla history, culture, Literature existent and to introduce it
globally we have to digitalize Bangla language. A statistical
model can play a vital role for this purpose. Bangla is the fourth
largest language of the world and in spite of having over 245
million native speakers, still now Bangla language have a
negligible amount of work on statistical natural language
processing.
In this paper, a statistical approach is proposed to parse
different kinds of Bangla sentences statistically. To achieve the
goal, stochastic or probabilistic context-free grammar is
introduced. As we used dynamic CYK algorithm, ambiguity can
easily detect. For increasing parsing efficiency, this model used
left binarization technique. To parse more accurate structure,
this model considered the rule probability as well as word
probability. The proposed system is evaluated by the wide range
of sentences with changeable word lengths and show that this
probabilistic parser can parse Bangla sentence successfully.
II. RELATED WORK
Many works have done on the statistical parsing of English
sentence. Sampson proposed the first Stochastic or probabilistic
approach to parse sentences in [3]. A series of five statistical
model of translation is described in [4]. Michael Collins
developed a statistical parser [5] which plays a tremendously
powerful role in NLP. Statistical parsing system based the Penn
Wall Street Journal Treebank proposed in [6]. However, parsing
is done based on the statistical decision for complex grammar is
shown in [7]. Some work has been performed on the parsing
methodology of Bangla. A parsing methodology of top down
and bottom up parsing approach by using phrase structure rules
to parse Bangla simple sentence in [8]. While a rule based
655
International Conference on Electrical, Computer and Communication Engineering (ECCE), February 16-18, 2017, Cox’s Bazar, Bangladesh