International Journal of Advanced and Applied Sciences, 3(9) 2016, Pages: 59-66
Contents lists available at Science-Gate
International Journal of Advanced and Applied Sciences
Journal homepage: http://www.science-gate.com/IJAAS.html
59
Artificial intelligence and natural language processing: the Arabic corpora in online
translation software
Mohammed Abdulmalik Ali *
Department of English, Prince Sattam Bin Abdulaziz University, AlKharj, Saudi Arabia
ARTICLE INFO ABSTRACT
Article history:
Received 25 May 2016
Received in revised form
28 August 2016
Accepted 20 September 2016
It is ironical to note that worldwide the Internet content in the Arabic
language is mere 1%, whereas 5% of the world population speaks Arabic.
This speaks of the disproportionate presence of on-line content of Arabic
language as compared to other languages which may be due to many reasons
including a lack of experts in the field of the Arabic language. This research
study will investigate the impact of such Machine Translation (MT) software
and TM tools that are widely used by the Arab community for their academic
and business purposes. The study aims at finding whether it is possible to
bring a paradigm shift from Arabic Localization to Arabic Globalization;
hence, facilitating the usage of NLP techniques in the human interface with
the computer. For this study; a few machine translation software (e.g.
SYSTRAN, IBM Watson) shall be studied for their content and applications, to
determine their usage without human intervention and retaining the
meaning of the original text.
Keywords:
Arabic corpora
Online content
Translation
Software
© 2016 The Authors. Published by IASE. This is an open access article under the CC
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction
* Researchers have known Natural Language
Processing (NLP) as that branch of Artificial
Intelligence (AI) that deals with analyzing a language
that is used by a human being to interface with a
computer. A great challenge that man has faced in
such an interface is to teach a computer the language
that a man can learn, understand and interact in,
which in the current context, is the Arabic language.
Being the largest living Semitic language, official
language of 23 countries, spoken by over 360 million
people worldwide (The Arab world population is
estimated to 369.8 million people (2013). The Arab
region maps from Morocco in North Africa to Dubai
in the Persian Gulf), Arabic language has ironically
less than 1% of worldwide Internet content when
5% of the world population speaks Arabic. This
speaks of the disproportionate presence of on-line
content of Arabic language as compared to other
languages. The reason given by NLP experts (Ali and
Khaled, 2009; Habash, 2010; Hijjawi and Elsheikh,
2015; Huang, 2015) with regard to analyzing the use
of the Arabic Diglossia is that Arabic has two forms
* Corresponding Author.
Email Address: mh.ali@psau.edu.sa (M. A. Ali)
https://doi.org/10.21833/ijaas.2016.09.010
2313-626X/© 2016 The Authors. Published by IASE.
This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
existing concurrently. The first is the Modern
Standard Arabic (MSA) which is widely used in
formal situations like formal speeches, government
and official operations, product manuals, and news
media and it is perceived as the Dzlanguage of the
minddz in contrast with the second form known as
Dialectal Arabic (DA). It is the informal private
language, predominantly found as spoken
vernaculars with no written standards, and
perceived as Dzlanguage of the heartdzǯ although the
Arab speakers perceive the use of dialects as a
Dzdeteriorateddz form of Classical Arabic ȋ(uang,
2015), a much debatable issue and outside the scope
of this study.
This research paper will be citing a few studies
that have principally addressed to these challenges
and shall also investigate the impact of such Machine
Translation (MT) software that are widely used by
the Arab community for their academic and business
purposes. This study will also cite examples of
discrepancies found in these software and search
engines. The main objective of this study is to find
whether it is possible to bring a paradigm shift from
Arabic Localization to Arabic Globalization in order
to facilitate the use of NLP techniques or even
formulate and modify the existing Arabic corpora for
better understanding of the language. The article
shall also discuss frequent colloquialism (e.g. Arab
chat alphabet known as Moaarab or Arabizi) as
found on social media platforms like Facebook and
Twitter not withstanding spelling errors,