International Journal of Research in Advent Technology (E-ISSN: 2321-9637) Special Issue National Conference “NCPCI-2016”, 19 March 2016 Available online at www.ijrat.org 107 Introduction to Real-Time Processing in Apache Apex Harsh Pathak 1 , Manas Rathi 2 , Aniket Parekh 3 Third Year Students 1,2,3 , Department of Computer Engineering, Vishwakarma Institute of Information Technology, Pune, Maharashtra, India. harshnpathak@gmail.com 1 , manas.rathi@outlook.com 2 , someshparekh@gmail.com 3 Abstract- With the advent the 21 st century, data across the World Wide Web was generated in huge quantity. This data was impossible to store on physical devices lest process it to obtain results. Data generated across social networks, wireless sensor networks and other big-data sources made it even more difficult for data to be processed instantaneously. Therefore the concept of stream processing emerged which had an advantage over Batch processing. Apache apex facilitated the real-time processing of unbounded stream of data, efficiently increasing the throughput of output stream. Apache Apex is a platform which provides YARN big data in motion that combines stream and batch processing. Big data processing is done in an extremely scalable, secured and fault tolerant way with high performance and simplicity provisions. In this paper, we are providing a case study of Apache Apex, its functionalities, and extendibility relating to real-world use cases. Thus by stating future applications of the platform, we justify its need. Index Terms- Apache Apex, Big Data, Hadoop,Stream Processing, Windowing, YARN. 1. INTRODUCTION As we know big data handling along with real time processing is a necessity today. One of the famous big data handling platforms include Hadoop. Hadoop mainly concentrates on operations using big data. It not only allows storage and processing of big data but also does this in a distributed network over a large scale of clustered computers. Being an open source framework it is designed to scale up from a single node to a large number of computers consisting of individual RAM and storage. [5] Apache Apex includes key features requested by open source developer community that are not available in current open source technologies. (1) Event Processing guarantees (2) In-memory performance & scalability (3) Fault tolerance and state management (4) Native rolling and tumbling window support (5) Hadoop-native YARN & HDFS implementation Figure 1 shows the overall architecture of Apache Apex. Apex is a YARN native platform which facilitates real-time stream processing. Rest API could be integrated along with real world applications. (1) Physical, Virtual, Cloud: Sources for input data set. (2) Hadoop: Comprises of YARN and HDFS forming the basis of streaming applications. (3) Streaming Runtime: In-memory processing of data in motion with windowing. (4) Malhar: Library of open source operators. (5)Streaming Applications: Includes business logic through DAG (Directed Acyclic Graph) (6)User Interface: For user interactions. Involves Launch, Dashboard and console. Fig. 1: Architectural framework of Apex. [6]