Ubiq: A Scalable and Fault-tolerant Log Processing Infrastructure Venkatesh Basker, Manish Bhatia, Vinny Ganeshan, Ashish Gupta, Shan He, Scott Holzer, Haifeng Jiang, Monica Chawathe Lenart, Navin Melville, Tianhao Qiu, Namit Sikka, Manpreet Singh, Alexander Smolyanov, Yuri Vasilevski, Shivakumar Venkataraman, and Divyakant Agrawal Google Inc. Abstract. Most of today’s Internet applications generate vast amounts of data (typically, in the form of event logs) that needs to be processed and analyzed for detailed reporting, enhancing user experience and in- creasing monetization. In this paper, we describe the architecture of Ubiq, a geographically distributed framework for processing continuously growing log files in real time with high scalability, high availability and low latency. The Ubiq framework fully tolerates infrastructure degrada- tion and data center-level outages without any manual intervention. It also guarantees exactly-once semantics for application pipelines to pro- cess logs as a collection of multiple events. Ubiq has been in production for Google’s advertising system for many years and has served as a criti- cal log processing framework for several dozen pipelines. Our production deployment demonstrates linear scalability with machine resources, ex- tremely high availability even with underlying infrastructure failures, and an end-to-end latency of under a minute. Key words: Stream processing, Continuous streams, Log processing, Distributed systems, Multi-homing, Fault tolerance, Distributed Con- sensus Protocol, Geo-replication 1 Introduction Most of today’s Internet applications are data-centric: they are driven by back- end database infrastructure to deliver the product to their users. At the same time, users interacting with these applications generate vast amounts of data that need to be processed and analyzed for detailed reporting, enhancing the user experience and increasing monetization. In addition, most of these applica- tions are network-enabled, accessed by users anywhere in the world at any time. The consequence of this ubiquity of access is that user-generated data flows con- tinuously, referred to as a data stream. In the context of an application, the data stream is a sequence of events that effectively represents the history of users’ interactions with the application. The data is stored as a large number of files, collectively referred to as an input log (or multiple input logs if the application demands it, e.g., separate query and click logs for a search application). The log