Research Article Accelerating the HyperLogLog Cardinality Estimation Algorithm Cem Bozkus 1 and Basilio B. Fraguela 2 1 Bilkent ¨ Universitesi, Ankara, Turkey 2 Universidade da Coru˜ na, A Coru˜ na, Spain Correspondence should be addressed to Basilio B. Fraguela; basilio.fraguela@udc.es Received 29 June 2017; Accepted 6 August 2017; Published 14 September 2017 Academic Editor: Piotr Luszczek Copyright © 2017 Cem Bozkus and Basilio B. Fraguela. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In recent years, vast amounts of data of diferent kinds, from pictures and videos from our cameras to sofware logs from sensor networks and Internet routers operating day and night, are being generated. Tis has led to new big data problems, which require new algorithms to handle these large volumes of data and as a result are very computationally demanding because of the volumes to process. In this paper, we parallelize one of these new algorithms, namely, the HyperLogLog algorithm, which estimates the number of diferent items in a large data set with minimal memory usage, as it lowers the typical memory usage of this type of calculation from () to (1). We have implemented parallelizations based on OpenMP and OpenCL and evaluated them in a standard multicore system, an Intel Xeon Phi, and two GPUs from diferent vendors. Te results obtained in our experiments, in which we reach a speedup of 88.6 with respect to an optimized sequential implementation, are very positive, particularly taking into account the need to run this kind of algorithm on large amounts of data. 1. Introduction Very ofen the processing of very large data sets does not require accurate solutions, being enough to fnd approximate ones that can be achieved much more efciently. Tis strategy, called approximate computing, has been used in comput- ing for many years and can be applied in those contexts where answers that are close enough to the actual value are acceptable, giving place to a trade-of of accuracy for other resources, typically memory space and time. For example, the scholastic gradient descent algorithm of machine learning is used to calculate approximate local minimum and not exact global minimum. Another example is bloom flters [1], which allow easily checking whether an item is in a set or not by using multiple hash functions, there being a certain probability of false positives, that is, of classifying as members of the set items that actually do not belong to it. HyperLogLog (HLL) [2] is a very powerful approximate algorithm in the sense that it can practically and efciently give a good estimation of the cardinality of a data set, meaning the number of diferent items in it with respect to some char- acteristics. Tis value has many real life applications, making its computation a must for high profle companies working in big data. Tis algorithm is used by basically anyone who has a data set and needs its cardinality without wasting space, as it can calculate the cardinality of items with (1) space complexity. Because of the way it is implemented, HLL allows many smaller budget companies and individuals without large memory banks to use large data sets. In this paper we develop two parallel implementations of the HyperLogLog algorithm, one of them based on OpenMP and targeted to multicore processors and Intel Xeon Phi accelerators and another one based on OpenCL, which can be run not only on these systems but also on other kinds of accelerators such as GPUs. Both implementations are com- pared on an Intel Xeon Phi and a standard multicore system, while the performance of the OpenCL version is evaluated in these platforms as well as in an NVIDIA Tesla K20m GPU and an AMD FirePro S9150 GPU. Te remainder of this paper is organized as follows. Section 2 briefy summarizes the HyperLogLog algorithm and its sequential implementation, while Section 3 details our parallel implementations. Tis is followed by an evaluation in Section 4. Ten, Section 5 is devoted to the related work. Finally, Section 6 is devoted to our concluding ideas. Hindawi Scientific Programming Volume 2017, Article ID 2040865, 8 pages https://doi.org/10.1155/2017/2040865