Arti Chugh et al. International Journal of Recent Research Aspects ISSN: 2349-7688, Vol. 3, Issue 2, June 2016, pp. 25-27 © 2016 IJRRA All Rights Reserved page - 25- Big Data for Improvement in Translation Process of Webpages Aarti Chugh Department of Computer Science and Engineering, Amity University Haryana Abstract— With an extensive improvement in network infrastructure where information is flowing across national boundaries there is a need to handle language barriers. Many algorithms are designed for translation of the information on a webpage into different languages. This paper lays out the concept to improve the translation process of the available webpages information for multiple languages by the means of application of data mining and big-data technology. Keywords—Big Data; Web Search Engine; Crawlers; Searching; Machine Translation; Phrase-Based Statistical Machine Translation; Noise Channel model; Bilingual Evaluation Understudy; Finite Automata I. INTRODUCTION The paper emphasis on the efficient use of parallel text and big data technology to translate the available web information for the multiple languages. For example translating from English to Hindi sometimes turned to be the incorrect translation which may affect the official government works. The need of effective translation can be fulfilled by scanning and computing the smart algorithms. These algorithms definitely comprise the volume, velocity and variety of Big Data. Traditionally search engine analyses the patterns from the large amount of text. Then these patterns are processed traditionally with the help of three techniques namely 1. Web crawling 2. Indexing 3. Searching. But for an efficient translation Bilingual Evaluation Understudy (BLEU) algorithm will be followed through a fuzzy approach where possibilities other than 0 and 1 can be considered. This can be implemented by applying BLEU algorithm on the set of English sentences E to convert it into another set of language Hindi sentences H. This will give all the possible correct translation from the data sources. II. METHODOLOGY A. Scanning Data The first step involves the data pruning using the Noisy Channel Method. Traditionally the set of strings are extracted from the set of ∑. Here the set of incorrect strings will not be accepted by the ∑*. Translation brings the number of permutation and combination of the strings. The incorrect translated text can be pruned using noisy channel method. B. Knowledge Discovery using Big Data Since the data is increasing exponentially and thereby also increasing the complexity to search the patterns. Hence making more difficult to design the algorithms. But there are examples like LinkedIn wherein the Big Data technology is creatively implement to generate the recommendations on the right side of the web pages. Similarly a semantic rule can be designed by learning a standardized Decision Tree. The decision tree induction will provide the platform to translate the multiple languages. The data preparation step is a crucial step as it forms the basis of the translation. The particular Grammar of that language can be included as an algorithm and this set of rules will help to clean and prepare data for the translation. For example applying algorithm on one set of data English, E = {E1, E2, E3,…….,En} will convert it into another set of data Hindi H = {H1,H2,H3,…….,Hn}. C. WEB SEARCH ENGINE The mechanisms under which the web search engines work should be modified for the translation. The key changes that can be outlined in the mechanism includes the crawling of the speeches with addition to the text from the various Big Data technologies like Hadoop so that an improved meaning can be derived for the particular strings of data. The next step during the knowledge discovery is indexing where the modelling should be done through analyzing only that content which are meaningful. This should be automatically saved for the future reference. Then the best matching keywords – the set of correct strings should be indexed and searched simultaneously so that it can display the desired outputs. These outputs will certainly will be exactly the outputs of a human intelligence. D. Association Rule Normally crawlers are designed to search the content through spider web approach. This approach can be boosted by having the support and confidence of the translated text. It will add the human intelligence since it follows what the public or the language interpreters may predict. The computed support and confidence should be of the pruned data. E. Models  To compute the correct support and confidence it is required to have a proper set of data with minimum cost. For this the data warehouse model Fact constellation should be used. Since it will share the common semantic rules for the two languages.  Once the fact constellation is achieved, meta-data can be derived out of that. This will show the transformations, relationship history of the grammars of the two languages. Once the meta-data shows the grammar, the best possible algorithms can be designed for the effective translation.  Flexibility should be allowed in the designing of the fact constellation since data increases exponentially. It enables the correct meta-data repository from the huge amount of data.  The private files of the web pages should not be crawled. It is recommended to use robot.txt to protect the data from indexing. F. Mathematical Interpretation