Arti Chugh et al. International Journal of Recent Research Aspects ISSN: 2349-7688, Vol. 3,
Issue 2, June 2016, pp. 25-27
© 2016 IJRRA All Rights Reserved page - 25-
Big Data for Improvement in Translation
Process of Webpages
Aarti Chugh
Department of Computer Science and Engineering, Amity University Haryana
Abstract— With an extensive improvement in network infrastructure where information is flowing across national boundaries there
is a need to handle language barriers. Many algorithms are designed for translation of the information on a webpage into different
languages. This paper lays out the concept to improve the translation process of the available webpages information for multiple
languages by the means of application of data mining and big-data technology.
Keywords—Big Data; Web Search Engine; Crawlers; Searching; Machine Translation; Phrase-Based Statistical Machine Translation;
Noise Channel model; Bilingual Evaluation Understudy; Finite Automata
I. INTRODUCTION
The paper emphasis on the efficient use of parallel text and big
data technology to translate the available web information for
the multiple languages. For example translating from English
to Hindi sometimes turned to be the incorrect translation
which may affect the official government works. The need of
effective translation can be fulfilled by scanning and
computing the smart algorithms. These algorithms definitely
comprise the volume, velocity and variety of Big Data.
Traditionally search engine analyses the patterns from the
large amount of text. Then these patterns are processed
traditionally with the help of three techniques namely 1. Web
crawling 2. Indexing 3. Searching. But for an efficient
translation Bilingual Evaluation Understudy (BLEU)
algorithm will be followed through a fuzzy approach where
possibilities other than 0 and 1 can be considered. This can be
implemented by applying BLEU algorithm on the set of
English sentences E to convert it into another set of language
Hindi sentences H. This will give all the possible correct
translation from the data sources.
II. METHODOLOGY
A. Scanning Data
The first step involves the data pruning using the Noisy
Channel Method. Traditionally the set of strings are extracted
from the set of ∑. Here the set of incorrect strings will not be
accepted by the ∑*. Translation brings the number of
permutation and combination of the strings. The incorrect
translated text can be pruned using noisy channel method.
B. Knowledge Discovery using Big Data
Since the data is increasing exponentially and thereby also
increasing the complexity to search the patterns. Hence
making more difficult to design the algorithms. But there are
examples like LinkedIn wherein the Big Data technology is
creatively implement to generate the recommendations on the
right side of the web pages. Similarly a semantic rule can be
designed by learning a standardized Decision Tree. The
decision tree induction will provide the platform to translate
the multiple languages. The data preparation step is a crucial
step as it forms the basis of the translation. The particular
Grammar of that language can be included as an algorithm and
this set of rules will help to clean and prepare data for the
translation. For example applying algorithm on one set of data
English, E = {E1, E2, E3,…….,En} will convert it into
another set of data Hindi H = {H1,H2,H3,…….,Hn}.
C. WEB SEARCH ENGINE
The mechanisms under which the web search engines work
should be modified for the translation. The key changes that
can be outlined in the mechanism includes the crawling of the
speeches with addition to the text from the various Big Data
technologies like Hadoop so that an improved meaning can be
derived for the particular strings of data. The next step during
the knowledge discovery is indexing where the modelling
should be done through analyzing only that content which are
meaningful.
This should be automatically saved for the future reference.
Then the best matching keywords – the set of correct strings
should be indexed and searched simultaneously so that it can
display the desired outputs. These outputs will certainly will
be exactly the outputs of a human intelligence.
D. Association Rule
Normally crawlers are designed to search the content through
spider web approach. This approach can be boosted by having
the support and confidence of the translated text. It will add
the human intelligence since it follows what the public or the
language interpreters may predict. The computed support and
confidence should be of the pruned data.
E. Models
To compute the correct support and confidence it is
required to have a proper set of data with minimum
cost. For this the data warehouse model Fact
constellation should be used. Since it will share the
common semantic rules for the two languages.
Once the fact constellation is achieved, meta-data can
be derived out of that. This will show the
transformations, relationship history of the grammars
of the two languages. Once the meta-data shows the
grammar, the best possible algorithms can be designed
for the effective translation.
Flexibility should be allowed in the designing of the
fact constellation since data increases exponentially. It
enables the correct meta-data repository from the huge
amount of data.
The private files of the web pages should not be
crawled. It is recommended to use robot.txt to protect
the data from indexing.
F. Mathematical Interpretation