Big Data Streaming Platforms to Support Real-time Analytics Eliana Fernandes 1 , Ana Carolina Salgado 2 a and Jorge Bernardino 1,3 b 1 Polytechnic of Coimbra – ISEC, Rua Pedro Nunes, Quinta da Nora, 3030-199 Coimbra, Portugal 2 Centre for Informatics, Universidade Federal de Pernambuco, Recife, Brazil 3 Centre for Informatics and Systems of the University of Coimbra (CISUC), Portugal Keywords: Streaming, Real-time Analytics, Big Data, Fault-Tolerance. Abstract: In recent years data has grown exponentially due to the evolution of technology. The data flow circulates in a very fast and continuous way, so it must be processed in real time. Therefore, several big data streaming platforms have emerged for processing large amounts of data. Nowadays, companies have difficulties in choosing the platform that best suits their needs. In addition, the information about the platforms is scattered and sometimes omitted, making it difficult for the company to choose the right platform. This work focuses on helping companies or organizations to choose a big data streaming platform to analyze and process their data flow. We provide a description of the most popular platforms, such as: Apache Flink, Apache Kafka, Apache Samza, Apache Spark and Apache Storm. To strengthen the knowledge about these platforms, we also approached their architectures, advantages and limitations. Finally, a comparison among big data streaming platforms will be provided, using as attributes the characteristics that companies usually most need. 1 INTRODUCTION The explosive growth of the Internet has caused large amounts of data to be generated. The companies try to react to this evolution and if data isn’t processed efficiently and at the same speeds (Safaei, 2017). Big data is a generic term for organizing, processing, and aggregating large amounts of data. The data that has a fast and continuous changing is called streaming data (Behera et al., 2018). It needs to be analyzed in a short period of time. Traditional Business Intelligence tools aren’t suitable for analyzing streaming data in real time, because is processed in batch processing (Behera et al., 2018). A large number of big data streaming platforms have been developed (Imanuel, 2019). Big data streaming platforms are the main challenge for most companies. The requirements of companies are sometimes different from the features that these platforms offer. The objective of this work is to assist in choosing a big data streaming platform, taking into account the characteristics that platforms may have for companies. As well as, is to describe and compare the most popular and open-source big a https://orcid.org/0000-0003-4036-8064 b https://orcid.org/0000-0001-9660-2011 data streaming platforms, such as: Flink, Kafka, Samza, Spark and Storm (Imanuel, 2019). The rest of this paper is structured as follows. Section 2 provides an overview of the big data streaming platforms, their architecture, advantages and limitations. Section 3 presents a comparative study of these platforms. The conclusions and future work are presented in Section 4. 2 STREAMING PLATFORMS Processing data means manipulating, aggregating in order to transform data into useful information. Big data streaming processing is always up-to- date. So, when the data is available, it’s processed immediately and is transformed into information. To ensure continuous and stable operation of the entire system it is necessary that the platform has a suitable architecture design. The architectures for big data streaming platforms, can be: symmetrical architecture and master-slave architecture. In symmetrical architecture, the functions of each node are the same and have good scalability. 426 Fernandes, E., Salgado, A. and Bernardino, J. Big Data Streaming Platforms to Support Real-time Analytics. DOI: 10.5220/0009817304260433 In Proceedings of the 15th International Conference on Software Technologies (ICSOFT 2020), pages 426-433 ISBN: 978-989-758-443-5 Copyright c 2020 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved