Chapter 2 Getting Started with Hadoop Apache Hadoop is a software framework that allows distributed processing of large datasets across clusters of computers using simple programming constructs/mod- els. It is designed to scale-up from a single server to thousands of nodes. It is designed to detect failures at the application level rather than rely on hardware for high-availability thereby delivering a highly available service on top of cluster of commodity hardware nodes each of which is prone to failures [2]. While Hadoop can be run on a single machine the true power of Hadoop is realized in its ability to scale-up to thousands of computers, each with several processor cores. It also distributes large amounts of work across the clusters efficiently [1]. The lower end of Hadoop-scale is probably in hundreds of gigabytes, as it was designed to handle web-scale of the order of terabytes to petabytes. At this scale the dataset will not even fit a single computer’s hard drive, much less in memory. Hadoop’s distributed file system breaks the data into chunks and distributes them across several computers to hold. The processes are computed in parallel on all these chunks, thus obtaining the results with as much efficiency as possible. The Internet age has passed and we are into the data age now. The amount of data stored electronically cannot be measured easily; IDC estimates put the total size of the digital universe at 0.18 Zetabytes in 2006 and it is expected to grow tenfold by 2011 to 1.8 Zeta-bytes [9]. A Zetabyte is 10 21 bytes, or equivalently 1000 Exabytes, 1,000,000 Petabytes or 1bn Terabytes. This is roughly equivalent to one disk drive for every person in the world [10]. This flood of data comes from many sources. Consider the following: The New York Stock Exchange generates about one terabyte of trade data per day. Facebook hosts approximately 10 billion photos, taking up one petabyte of stor- age. Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month. c Springer International Publishing Switzerland 2015 K.G. Srinivasa and A.K. Muppalla, Guide to High Performance Distributed Computing, Computer Communications and Networks, DOI 10.1007/978-3-319-13497-0 2 33