1
Supervised Urdu Word Segmentation Model Based on
POS Information
Sadiq Nawaz Khan
1,
*, Khairullah Khan
1
and Wahab Khan
2
1
Department of Computer Science, University of Science & Technology Bannu, Pakistan
1
Department of Computer Science, University of Science & Technology Bannu, Pakistan
2
Department of Computer Science & Software Engineering, IIU, Islamabad 44000, Pakistan
Abstract
Urdu is the national language of Pakistan, also the most widely spoken and understandable language of the globe. In order
to accomplish successful Urdu NLP a robust and high-performance NLP tools and resources are utmost necessary. Word
segmentation takes on an authoritative role for morphologically rich languages such as Urdu for diverse NLP domains
such as named entity recognition, sentiment analysis, part of speech tagging, information retrieval etc. The morphological
richness property of Urdu adds to the challenges of the word segmentation task, because a single word can be composed of
null or a few prefixes, a stem and null or a few suffixes. In this paper we present supervised Urdu word segmentation
scheme based on part of speech (POS) information of the corresponding words. For experiments conditional random fields
(CRF) with contextual feature is used. The performance of the proposed system is evaluated on 300K words, results shows
evidential improvements on baseline approach.
Keywords: Urdu, Word segmentation, supervised learning, conditional random fields
Received on 10 May 2018, accepted on 04 September 2018, published on 10 September 2018
Copyright © 2018 Sadiq Nawaz Khan et al., licensed to EAI. This is an open access article distributed under the terms of
the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/3.0/), which permits unlimited use,
distribution and reproduction in any medium so long as the original work is properly cited.
doi: 10.4108/eai.19-6-2018.155444
*
Corresponding author. Sadiqnawaz97@gmail.com
1. Introduction
Nowadays Natural Language Processing plays a vital role
in every field of computer science. Human beings are
trying to simulate human knowledge by computer system.
For this purpose, NLP researchers struggle by introducing
knowledge through which computers understand and use
natural language. To achieve desired tasks different types
of advanced tools and procedures are applied to make
computer systems more cognizable. Various disciplines
lie in NLP fundamentals such as electronic and electrical
engineering, linguistics, information and computer
sciences, mathematics, psychology and artificial
intelligence (AI) etc [1]. Natural Language Processing
applications are widely used which mainly consist of
different fields of studies, like word segmentation, speech
recognition, text processing and summarization, CLIR
(cross language information retrieval), user interfaces,
voice recognition and artificial intelligence etc.
Information retrieval (IR) recognizes desired valuable
information from a huge collection of data while
information extraction (IE) is used to process document(s)
for identification of such entities or events that are pre-
specified or a technique that processes a document(s), to
identify pre-specified entities or events. Artificial
intelligence is a sub-field of computer science in which
we study the development of hardware and software that
simulates human intelligence.
For every NLP application Word Segmentation has
vital role. Word segmentation is capable of separation
written or oral text into meaningful word tokens. It
identified words boundaries in a spoken language. In
Research Article
EAI Endorsed Transactions
on Scalable Information Systems
EAI Endorsed Transactions on
Scalable Information Systems
Online First