A Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and Documents P. Nesi, G. Pantaleo and G. Sanesi Distributed Systems and Internet Technology Lab, DISIT Lab, http://www.disit.dinfo.unifi.it Department of Information Engineering (DINFO), University of Florence – Firenze, Italy paolo.nesi@unifi.it, gianni.pantaleo@unifi.it Abstract—The recent growth of the World Wide Web at increasing rate and speed and the number of online available resources populating Internet represent a massive source of knowledge for various research and business interests. Such knowledge is, for the most part, embedded in the textual content of web pages and documents, which is largely represented as unstructured natural language formats. In order to automatically ingest and process such huge amounts of data, single-machine, non-distributed architectures are proving to be inefficient for tasks like Big Data mining and intensive text processing and analysis. Current Natural Language Processing (NLP) systems are growing in complexity, and computational power needs have been significantly increased, requiring solutions such as distributed frameworks and parallel computing programming paradigms. This paper presents a distributed framework for executing NLP related tasks in a parallel environment. This has been achieved by integrating the APIs of the widespread GATE open source NLP platform in a multi-node cluster, built upon the open source Apache Hadoop file system. The proposed framework has been evaluated against a real corpus of web pages and documents. Keywords – Natural Language Processing, Part-of-Speech Tagging, Parallel Computing, Distributed Systems. I. INTRODUCTION Nowadays we are living in a hyper-connected digital world, dealing with computational solutions progressively more oriented to data-driven approaches, due to the increasing population of web resources and availability of extremely vast amounts of data [1]. This fact has opened new attractive opportunities in several application areas such as e-commerce, business web services, Smart City platforms, e-Healthcare, scientific research, ICT technologies and many others. Modern information societies are characterized by handling vast data repositories (both public and private), according to which Petabyte datasets are rapidly becoming the norm. For instance, in 2014 Google has been estimated to process over 100 Petabytes per day on 3 million servers 1 ; in the first half of 2014, Facebook warehouse has been reported to store about 1 http://www.slideshare.net/kmstechnology/big-data-overview-2013-2014. 300 PB of Hive data (with an incoming daily rate of about 600 TB) 2 . These numbers reveal how the process of storing data is rapidly overcoming our ability to process them in an automatic and efficient way. Such a huge pool of available information has attracted a lot of interest within several research and commercial scenarios, ranging from automatic comprehension of natural language text documents, supervised document classification [2], content extraction and summarization [3], design of expert systems, recommendation tools, Question- Answer systems, query expansion [4], up to social media mining and analysis, in order to collect and assess users’ trends and habits for target-marketing, customized services. Therefore, applications and services must be able to scale up to items, domains and data subset of interest. Approaches based on parallel and distributed architectures are spreading also in search engines and indexing systems; for instance, the ElasticSearch 3 engine, which is an open source distributed search engine, designed to be scalable, near real-time capable and providing full-text search capabilities [5]. NLP systems are commonly executed in a pipeline where different plugins and tools are responsible for a specific task. One of the most intensive tasks is annotation, defined as the process of adding linguistic information to language data [6], usually related to their grammatical, morphological and syntactical role within the given context. Text annotation is at the basis of higher level tasks such as extraction of keywords and keyphrases (which deals with the identification of a set of single words or small phrases representing key segments that can describe the meaning of a document) [7], up to the design of Semantic Computing frameworks (involving activities such as supervised classification of entities, attributes and relations), that lead to the production of structured data forms, typically in the form of taxonomies, thesauri and ontologies. Current sequentially integrated NLP architectures, where each pipeline step uses intermediate results and outcomes produced by previous processing stages, are prone to problems related with information flow, from congestion to information losses [8]. It 2 https://code.facebook.com/posts/229861827208629/scaling-the-facebook- data-warehouse-to-300-pb/. 3 https://www.elastic.co/products/elasticsearch DOI reference number: 10.18293/DMS2015-024