Parallel programming approaches for efficient natural language processing of big data Research paper Salha M. Alzahrani Department of Computer Science, College of Computers and Information Technology, Taif University. E-mail: s.zahrani@tu.edu.sa Abstract Natural language processing and related fields such as computational linguistics and information retrieval deal with big data collections. With the advent of recent storage technology, large data collections can be stored efficiently at low prices. Yet the price of performance in terms of speed and efficiency has become a problematic. In this paper, we address the challenge of natural language processing with various sizes of text data collections and propose parallel programming approaches on multi-core machines. Three architectures of parallelism for natural language processing are investigated: big data-simple process, small data-complex process, big data-complex process. Essential methods in NLP namely tokenisation, stemming, lemmatisation, POS tagging, near duplicate detection, and document ranking were investigated using different levels of parallelism. Experimental works were run on single, dual, quad, 8-core, 16-core, and 32-core computer machines. Recent tools offered by C#.NET, Python, and MATLAB were used for efficient parallelisation whereby data distribution and locks management are performed automatically. Datasets of sizes 1.41GB, 5.09GB, and 9.68GB were designed for this research which fulfil researchers’ needs and sufficient to prove efficacy of the proposed methods. Results from multi-core parallel computing were compared to baselines of sequential computing in NLP algorithms. Speed up achieved in parallel tokenisation was positive with word unigram and word n-gram but negative with sentence and char n-gram tokenisation schemes. This could be justified by referring to simplicity on the sequential version of sentence and char n-gram tokenisation compared with a fairly significant overhead by the parallel version to create and organise multiple threads. The resulting speed up in stemming, lemmatisation, POS tagging, near duplicate detection, and document ranking was highly remarkable. Depending on the complexity of NLP algorithm, the size of dataset, and the number of cores, parallel of NLP algorithms achieved speed up varies from 2x to 10x faster processing time compared with the sequential baselines. Keywords: parallel computing; NLP; Natural Language Processing; accelerate processing; mutli-core machines 1 Introduction The growth of data, the growth of storage technology, and the growth of computing resources are three dimensions that have to be met. Data is growing so rapidly because of the Internet and social media available today. Storage technology is emerging nowadays as one can save gigabytes to terabytes of data using cheap storage medium. Computing and processing power has become fast jumping from few hundreds of Megahertz (MHz) to Gigahertz (GHz). The underlying processing hardware is built nowadays to support parallelism, or in other words, to perform multiple processes and tasks concurrently. Serial or sequential computing denotes a method of writing algorithms using a discrete series of instructions which are executed sequentially one after another on a single processor. With the advent of recent processor technology and high performance computing, parallel computing has become the dominant way of solving real-world problems. Parallel computing is a method of dividing the solution of a problem into dedicated sets of instructions such that each set (i.e. sub program) is executed concurrently on a processing unit [1]. In this regard, one can think of a computational problem to be broken apart into discrete pieces of work that can be solved simultaneously by executing multiple program instructions at any moment in time. The results obtained on parallel can be gathered to form International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 10, October 2016 618 https://sites.google.com/site/ijcsis/ ISSN 1947-5500