Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC? Nikola Rajovic §‡ , Paul M. Carpenter § , Isaac Gelado § , Nikola Puzovic § , Alex Ramirez §‡ , Mateo Valero §‡ § Barcelona Supercomputing Center ‡ Universitat Politècnica de Catalunya C/ Jordi Girona 29 C/ Jordi Girona 1-3 08034 Barcelona, Spain 08034 Barcelona, Spain {first.last}@bsc.es ABSTRACT In the late 1990s, powerful economic forces led to the adop- tion of commodity desktop processors in high-performance computing. This transformation has been so effective that the June 2013 TOP500 list is still dominated by x86. In 2013, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smart- phones and tablets, most of which are built with ARM-based SoCs. This leads to the suggestion that once mobile SoCs deliver sufficient performance, mobile SoCs can help reduce the cost of HPC. This paper addresses this question in detail. We analyze the trend in mobile SoC performance, comparing it with the similar trend in the 1990s. We also present our experience evaluating performance and efficiency of mobile SoCs, de- ploying a cluster and evaluating the network and scalability of production applications. In summary, we give a first an- swer as to whether mobile SoCs are ready for HPC. 1. INTRODUCTION During the early 1990s, the supercomputing landscape was dominated by special-purpose vector and SIMD architec- tures. Vendors such as Cray (vector, 41%), MasPar (SIMD, 1 11%), and Convex/HP (vector, 5% 2 ) designed and built their own HPC computer architectures for maximum perfor- mance on HPC applications. During the mid to late 1990s, microprocessors used in the workstations of the day, like DEC Alpha, SPARC and MIPS, began to take over high- performance computing. About ten years later, these RISC CPUs were, in turn, displaced by the x86 CISC architecture used in commodity PCs. Figure 1 shows how the number of systems, of each of these types, has evolved since the first 1 SIMD: Single-Instruction Multiple Data 2 Figures are vendor system share in June 1993 TOP500 [41]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. SC ’13, November 17 - 21 2013, Denver, CO, USA Copyright 2013 ACM 978-1-4503-2378-9/13/11. . . $15.00. http://dx.doi.org/10.1145/2503210.2503281 1995 2000 2005 2010 Year 0 100 200 300 400 500 Number of systems in TOP500 x86 RISC Vector/SIMD Figure 1: TOP500: Special-purpose HPC replaced by RISC microprocessors, in turn displaced by x86 publication of the TOP500 list in 1993 [41]. Building an HPC chip is very expensive in terms of research, design, verification, and creation of photomask. This cost needs to be amortized over the maximum number of units to minimize their final price. This is the reason for the trend in Figure 1. The highest-volume commodity market, which was until the mid-2000s the desktop market, tends to drive lower-volume higher-performance markets such as servers and HPC. The above argument requires, of course, that lower-end com- modity parts are able to attain a sufficient level of per- formance, connectivity and reliability. To shed some light on the timing of transitions in the HPC world, we look at the levels of CPU performance during the move from vec- tor to commodity microprocessors. Figure 2(a) shows the peak floating point performance of HPC-class vector pro- cessors from Cray and NEC, compared with floating-point- capable commodity microprocessors. The chart shows that commodity microprocessors, targeted at personal comput- ers, workstations, and servers were around ten times slower, for floating-point math, than vector processors, in the pe- riod 1990 to 2000 as the transition in HPC from vector to microprocessors gathered pace. The lower per-processor performance meant that an appli- cation had to exploit ten parallel microprocessors to achieve the performance of a single vector CPU, and this required new programming techniques, including message-passing pro-